Methods and compositions for analyzing nucleic acid

ABSTRACT

The technology relates in part to methods and compositions for analyzing nucleic acid. In some aspects, the technology relates to methods and compositions for preparing a nucleic acid library. In some aspects, the technology relates to methods and compositions for analyzing ends of nucleic acid fragments.

RELATED PATENT APPLICATIONS

This patent application is a continuation of U.S. patent applicationSer. No. 16/961,113, filed on Jul. 9, 2020, entitled METHODS ANDCOMPOSITIONS FOR ANALYZING NUCLEIC ACID, naming Kelly M. HARKINS KINCAIDet al. as inventors, and designated by attorney docket no. CBS-2001 US,which is a 35 U.S.C. 371 national phase application of InternationalPatent Cooperation Treaty (PCT) Application No. PCT/US2019/013210, filedon Jan. 11, 2019, entitled METHODS AND COMPOSITIONS FOR ANALYZINGNUCLEIC ACID, naming Kelly M. HARKINS KINCAID et al. as inventors, anddesignated by attorney docket no. CBS-2001-PC. International PCTApplication No. PCT/US2019/013210 claims the benefit of U.S. provisionalpatent application No. 62/617,055 filed on Jan. 12, 2018, entitledMETHODS AND COMPOSITIONS FOR ANALYZING NUCLEIC ACID, naming Kelly M.HARKINS KINCAID et al. as inventors, and designated by attorney docketno. CBS-2001-PV. International PCT Application No. PCT/US2019/013210also claims the benefit of U.S. provisional patent application No.62/618,382 filed on Jan. 17, 2018, entitled METHODS AND COMPOSITIONS FORANALYZING NUCLEIC ACID, naming Kelly M. HARKINS KINCAID et al. asinventors, and designated by attorney docket no. CBS-2001-PV2.International PCT Application No. PCT/US2019/013210 also claims thebenefit of U.S. provisional patent application No. 62/769,787 filed onNov. 20, 2018, entitled METHODS AND COMPOSITIONS FOR ANALYZING NUCLEICACID, naming Kelly M. HARKINS KINCAID et al. as inventors, anddesignated by attorney docket no. CBS-2001-PV3. The entire content ofthe foregoing applications is incorporated herein by reference,including all text, tables and drawings.

STATEMENT OF GOVERNMENTAL SUPPORT

This invention was made with government support under contract 1 R43CA232935-01 awarded by the National Institutes of Health. The governmenthas certain rights in this invention.

SEQUENCE LISTING

The instant application contains a Sequence Listing which has beensubmitted electronically in XML format and is hereby incorporated byreference in its entirety. Said XML copy, created on Aug. 2, 2023, isnamed CBS-2001CON1_SL.xml and is 85,411 bytes in size.

FIELD

The technology relates in part to methods and compositions for analyzingnucleic acid. In some aspects, the technology relates to methods andcompositions for preparing a nucleic acid library. In some aspects, thetechnology relates to methods and compositions for analyzing ends ofnucleic acid fragments.

BACKGROUND

Genetic information of living organisms (e.g., animals, plants andmicroorganisms) and other forms of replicating genetic information(e.g., viruses) is encoded in nucleic acid (i.e., deoxyribonucleic acid(DNA) or ribonucleic acid (RNA)). Genetic information is a succession ofnucleotides or modified nucleotides representing the primary structureof chemical or hypothetical nucleic acids.

A variety of high-throughput sequencing platforms are used for analyzingnucleic acid. The Illumina platform, for example, involves clonalamplification of adaptor-ligated DNA fragments. Another platform isnanopore-based sequencing, which relies on the transition of nucleicacid molecules or individual nucleotides through a small channel.Library preparation for certain sequencing platforms often includesfragmentation of DNA, modification of fragment ends, and ligation ofadapters, and may include amplification of nucleic acid fragments (e.g.,PCR amplification).

The selection of an appropriate sequencing platform for particular typesof nucleic acid analysis requires a detailed understanding of thetechnologies available, including sources of error, error rate, as wellas the speed and cost of sequencing. While sequencing costs havedecreased, the throughput and costs of library preparation can be alimiting factor. One aspect of library preparation includes modificationof the ends of nucleic acid fragments such that they are suitable for aparticular sequencing platform. Nucleic acid ends may contain usefulinformation. Accordingly, methods that modify nucleic acid ends (e.g.,for library preparation) while preserving the information contained inthe nucleic acid ends would be useful for processing and analyzingnucleic acid.

SUMMARY

Provided in some aspects are methods for producing a nucleic acidlibrary, comprising combining a nucleic acid composition comprisingtarget nucleic acids and a plurality of oligonucleotide species, wherea) some or all of the target nucleic acids comprise an overhang; b) someor all of the oligonucleotides in the plurality of oligonucleotidespecies comprise two strands, and an overhang at a first end and twonon-complementary strands at a second end; where the overhang is capableof hybridizing to a target nucleic acid overhang, where eacholigonucleotide species has a unique overhang sequence and length; c)each oligonucleotide in the plurality of oligonucleotide speciescomprises an oligonucleotide overhang identification sequence specificto one or more features of the oligonucleotide overhang; and d) thenucleic acid composition and the plurality of oligonucleotide species iscombined under conditions in which oligonucleotide overhangs hybridizeto target nucleic acid overhangs having a corresponding length, therebyforming hybridization products.

Provided in some aspects are methods for producing a nucleic acidlibrary, comprising a) combining a nucleic acid composition comprisingtarget nucleic acids and a plurality of oligonucleotide species, wherei) each oligonucleotide in the plurality of oligonucleotide speciescomprises one strand capable of forming a hairpin structure having asingle-stranded loop, where the loop comprises one or more ribonucleicacid (RNA) nucleotides, ii) some or all of the target nucleic acidscomprise an overhang, iii) some or all of the oligonucleotides in theplurality of oligonucleotide species comprise an overhang capable ofhybridizing to a target nucleic acid overhang, where eacholigonucleotide species has a unique overhang sequence and length, iv)each oligonucleotide in the plurality of oligonucleotide speciescomprises an oligonucleotide overhang identification sequence specificto one or more features of the oligonucleotide overhang, and v) thenucleic acid composition and the plurality of oligonucleotide species iscombined under conditions in which oligonucleotide overhangs hybridizeto target nucleic acid overhangs having a corresponding length, therebyforming hybridization products; and b) contacting the hybridizationproducts under cleavage conditions with one or more cleavage agentscapable of cleaving the hybridization products within the hairpin loopat the RNA nucleotide(s), thereby forming cleaved hybridizationproducts.

Also provided in some aspects are compositions comprising a plurality ofoligonucleotide species, where a) each oligonucleotide in the pluralityof oligonucleotide species comprises one strand capable of forming ahairpin structure having a single-stranded loop, where the loopcomprises one or more ribonucleic acid (RNA) nucleotides; b) some or allof the oligonucleotides in the plurality of oligonucleotide speciescomprise an overhang capable of hybridizing to an overhang in a targetnucleic acid, where each oligonucleotide species has a unique overhangsequence and length; and c) each oligonucleotide in the plurality ofoligonucleotide species comprises an oligonucleotide overhangidentification sequence specific to one or more features of theoligonucleotide overhang.

Provided in some aspects are methods for modifying nucleic acid ends,comprising a) combining a nucleic acid composition comprising targetnucleic acids and a plurality of oligonucleotide species, where i) eacholigonucleotide in the plurality of oligonucleotide species comprisesone or more cleavage sites capable of being cleaved under cleavageconditions, ii) some or all of the target nucleic acids comprise anoverhang, iii) some or all of the oligonucleotides in the plurality ofoligonucleotide species comprise two strands and a first overhang and asecond overhang, where each overhang is capable of hybridizing to atarget nucleic acid overhang, where each oligonucleotide species has aunique overhang sequence and length, iv) each oligonucleotide in theplurality of oligonucleotide species comprises at least twooligonucleotide overhang identification sequences specific to one ormore features of the first and second oligonucleotide overhangs, and v)the nucleic acid composition and the plurality of oligonucleotidespecies is combined under conditions in which oligonucleotide overhangshybridize to target nucleic acid overhangs having a correspondinglength, thereby forming hybridization products; b) contacting thehybridization products under cleavage conditions with one or morecleavage agents capable of cleaving the hybridization products at theone or more cleavage sites, thereby forming cleaved hybridizationproducts; and c) contacting the cleaved hybridization products with astrand-displacing polymerase, thereby forming blunt-ended nucleic acidfragments.

Also provided in some aspects are compositions comprising a plurality ofoligonucleotide species, where a) each oligonucleotide in the pluralityof oligonucleotide species comprises one or more cleavage sites capableof being cleaved under cleavage conditions; b) some or all of theoligonucleotides in the plurality of oligonucleotide species comprisetwo strands and a first overhang and a second overhang, where eachoverhang is capable of hybridizing to a target nucleic acid overhang,where each oligonucleotide species has a unique overhang sequence andlength; and c) each oligonucleotide in the plurality of oligonucleotidespecies comprises at least two oligonucleotide overhang identificationsequences specific to one or more features of the first and secondoligonucleotide overhangs.

Provided in some aspects are methods for modifying nucleic acid ends,comprising a) combining a nucleic acid composition comprising targetnucleic acids and a plurality of oligonucleotide species, where i) someor all of the oligonucleotides in the plurality of oligonucleotidespecies comprise two strands and an overhang at a first end and one ormore modified nucleotides at a second end, where the overhang is capableof hybridizing to a target nucleic acid overhang, where eacholigonucleotide species has a unique overhang sequence and length, ii)some or all of the target nucleic acids comprise an overhang, iii) eacholigonucleotide in the plurality of oligonucleotide species comprises anoligonucleotide overhang identification sequence specific to one or morefeatures of the oligonucleotide overhang, and iv) the nucleic acidcomposition and the plurality of oligonucleotide species is combinedunder conditions in which oligonucleotide overhangs hybridize to targetnucleic acid overhangs having a corresponding length, thereby forminghybridization products; and b) contacting the hybridization productswith a strand-displacing polymerase, thereby forming blunt-ended nucleicacid fragments.

Also provided in some aspects are compositions comprising a plurality ofoligonucleotide species, where a) some or all of the oligonucleotides inthe plurality of oligonucleotide species comprise two strands and anoverhang at a first end and one or more modified nucleotides at a secondend, where the overhang is capable of hybridizing to a target nucleicacid overhang, where each oligonucleotide species has a unique overhangsequence and length; and b) each oligonucleotide in the plurality ofoligonucleotide species comprises an oligonucleotide overhangidentification sequence specific to one or more features of theoligonucleotide overhang.

Provided in some aspects are methods for modifying nucleic acid ends,comprising a) combining a nucleic acid composition comprising targetnucleic acids and a plurality of oligonucleotide species, where i) theoligonucleotides in the plurality of oligonucleotide species comprisetwo strands and an overhang at a first end, where the first end overhangcomprises a palindromic sequence; ii) some or all of theoligonucleotides in the plurality of oligonucleotide species comprise anoverhang at a second end, where the second end overhang is capable ofhybridizing to a target nucleic acid overhang, where eacholigonucleotide species has a unique second end overhang sequence andlength, iii) some or all of the target nucleic acids comprise anoverhang, iv) each oligonucleotide in the plurality of oligonucleotidespecies comprises an oligonucleotide overhang identification sequencespecific to one or more features of the second end overhang, v) eacholigonucleotide in the plurality of oligonucleotide species comprisesone or more modified nucleotides, and vi) the nucleic acid compositionand the plurality of oligonucleotide species is combined underconditions in which first end overhangs hybridize to other first endoverhangs and second end overhangs hybridize to target nucleic acidoverhangs having a corresponding length, thereby forming circularhybridization products; b) contacting the hybridization products with anexonuclease, thereby generating exonuclease-treated hybridizationproducts; c) shearing the exonuclease-treated hybridization products,thereby generating sheared exonuclease-treated hybridization products;and d) separating fragments comprising a sequence in the oligonucleotidespecies from fragments not comprising a sequence in the oligonucleotidespecies, thereby generating separated, sheared, exonuclease-treatedhybridization products.

Also provided in some aspects are compositions comprising a plurality ofoligonucleotide species, where a) the oligonucleotides in the pluralityof oligonucleotide species comprise two strands and an overhang at afirst end, where the first end overhang comprises a palindromicsequence; b) some or all of the oligonucleotides in the plurality ofoligonucleotide species comprise an overhang at a second end, where thesecond end overhang is capable of hybridizing to a target nucleic acidoverhang, where each oligonucleotide species has a unique second endoverhang sequence and length; c) each oligonucleotide in the pluralityof oligonucleotide species comprises an oligonucleotide overhangidentification sequence specific to one or more features of the secondend overhang; and d) each oligonucleotide in the plurality ofoligonucleotide species comprises one or more modified nucleotides.

Provided in some aspects are methods for modifying nucleic acid ends,comprising a) combining a nucleic acid composition comprising targetnucleic acids and a plurality of oligonucleotide species, where i) someor all of the oligonucleotides in the plurality of oligonucleotidespecies comprise (1) two strands and an overhang at a first end and twonon-complementary strands at a second end, or (2) one strand capable offorming a hairpin structure having a single-stranded loop and anoverhang; where the overhang is capable of hybridizing to a targetnucleic acid overhang, where each oligonucleotide species has a uniqueoverhang sequence and length, ii) some or all of the target nucleicacids comprise an overhang, iii) each oligonucleotide in the pluralityof oligonucleotide species comprises an oligonucleotide overhangidentification sequence specific to one or more features of theoligonucleotide overhang, and iv) the nucleic acid composition and theplurality of oligonucleotide species is combined under conditions inwhich oligonucleotide overhangs hybridize to target nucleic acidoverhangs having a corresponding length, thereby forming hybridizationproducts; and b) contacting the hybridization products with astrand-displacing polymerase, thereby forming blunt-ended nucleic acidfragments.

Also provided in some aspects are compositions comprising a plurality ofoligonucleotide species, where a) some or all of the oligonucleotides inthe plurality of oligonucleotide species comprise i) two strands and anoverhang at a first end and two non-complementary strands at a secondend, or ii) one strand capable of forming a hairpin structure having asingle-stranded loop and an overhang; where the overhang is capable ofhybridizing to a target nucleic acid overhang, where eacholigonucleotide species has a unique overhang sequence and length; andb) each oligonucleotide in the plurality of oligonucleotide speciescomprises an oligonucleotide overhang identification sequence specificto one or more features of the oligonucleotide overhang.

Provided in some aspects are methods for modifying nucleic acid ends,comprising combining a nucleic acid composition comprising targetnucleic acids and a plurality of oligonucleotide species, where a) someor all of the oligonucleotides in the plurality of oligonucleotidespecies comprise at least one overhang comprising RNA nucleotides, wherethe overhang is capable of hybridizing to a target nucleic acidoverhang, where each oligonucleotide species has a unique overhangsequence and length, b) some or all of the target nucleic acids comprisean overhang, c) each oligonucleotide in the plurality of oligonucleotidespecies comprises an oligonucleotide overhang identification sequencespecific to one or more features of the oligonucleotide overhang, and d)the nucleic acid composition and the plurality of oligonucleotidespecies is combined under conditions in which oligonucleotide overhangshybridize to target nucleic acid overhangs having a correspondinglength, thereby forming hybridization products.

Also provided in some aspects are compositions comprising a plurality ofoligonucleotide species, where a) some or all of the oligonucleotides inthe plurality of oligonucleotide species comprise at least one overhangcomprising RNA nucleotides, where the overhang is capable of hybridizingto a target nucleic acid overhang, where each oligonucleotide specieshas a unique overhang sequence and length; and b) each oligonucleotidein the plurality of oligonucleotide species comprises an oligonucleotideoverhang identification sequence specific to one or more features of theoligonucleotide overhang.

Provided in some aspects are methods for producing a nucleic acidlibrary, comprising a) combining a nucleic acid composition comprisingtarget nucleic acids and a first pool of oligonucleotide species, wherei) some or all of the target nucleic acids comprise an overhang, ii)some or all of the oligonucleotides in the first pool of oligonucleotidespecies comprise an overhang capable of hybridizing to a target nucleicacid overhang, where each oligonucleotide species has a unique overhangsequence and length, iii) each oligonucleotide in the first pool ofoligonucleotide species comprises an oligonucleotide overhangidentification sequence specific to one or more features of theoligonucleotide overhang, iv) each oligonucleotide in the first pool ofoligonucleotide species comprises a first primer binding domain, and v)the nucleic acid composition and the first pool of oligonucleotidespecies are combined under conditions in which oligonucleotide overhangshybridize to target nucleic acid overhangs having a correspondinglength, thereby forming a first set of combined products; b) cleavingthe first set of combined products, thereby forming cleaved products;and c) combining the cleaved products and a second pool ofoligonucleotide species, where i) each oligonucleotide in the secondpool of oligonucleotide species comprises a first end and a second end,ii) each oligonucleotide in the second pool of oligonucleotide speciescomprises a second primer binding domain, where the first primer bindingdomain and the second primer binding domain are different, and iii) thecleaved products and the second pool of oligonucleotide species arecombined under conditions in which the oligonucleotides in the secondpool of oligonucleotide species attach at the first end to at least oneend of the cleaved products, thereby forming a second set of combinedproducts.

Also provided in some aspects are compositions comprising a) a firstpool of oligonucleotide species, where i) some or all of theoligonucleotides in the first pool of oligonucleotide species comprisean overhang capable of hybridizing to a target nucleic acid overhang,where each oligonucleotide species has a unique overhang sequence andlength, ii) each oligonucleotide in the first pool of oligonucleotidespecies comprises an oligonucleotide overhang identification sequencespecific to one or more features of the oligonucleotide overhang, andiii) each oligonucleotide in the first pool of oligonucleotide speciescomprises a first primer binding domain; and b) a second pool ofoligonucleotide species, where i) each oligonucleotide in the secondpool of oligonucleotide species comprises a first end and a second end,and ii) each oligonucleotide in the second pool of oligonucleotidespecies comprises a second primer binding domain, where the first primerbinding domain and the second primer binding domain are different.

Provided in some aspects are methods for producing a nucleic acidlibrary, comprising a) combining a nucleic acid composition comprisingtarget nucleic acids and a first pool of oligonucleotide species, wherei) some or all of the target nucleic acids comprise an overhang, ii)some or all of the oligonucleotides in the first pool of oligonucleotidespecies comprise an overhang at a first end capable of hybridizing to atarget nucleic acid overhang, where each oligonucleotide species has aunique overhang sequence and length, iii) each oligonucleotide in thefirst pool of oligonucleotide species comprises an oligonucleotideoverhang identification sequence specific to one or more features of theoligonucleotide overhang, iv) each oligonucleotide in the first pool ofoligonucleotide species comprises a first primer binding domain, and v)the nucleic acid composition and the first pool of oligonucleotidespecies are combined under conditions in which oligonucleotide overhangshybridize to target nucleic acid overhangs having a correspondinglength, thereby forming a first set of combined products; b) cleavingthe first set of combined products, thereby forming cleaved products;and c) combining the cleaved products and a second pool ofoligonucleotide species, where i) each oligonucleotide in the secondpool of oligonucleotide species comprises a first strand and a secondstrand, where the first strand is shorter than the second strand, andwhere the first strand and the second strand are complementary at afirst end of the oligonucleotide and the second strand comprises asingle strand at a second end of the oligonucleotide, ii) eacholigonucleotide in the second pool of oligonucleotide species comprisesan oligonucleotide identification sequence specific to the second poolof oligonucleotide species, iii) each oligonucleotide in the second poolof oligonucleotide species comprises a second primer binding domain onthe second strand, where the first primer binding domain and the secondprimer binding domain are different, and iv) the cleaved products andthe second pool of oligonucleotide species are combined under conditionsin which oligonucleotides in the second pool of oligonucleotide speciesattach to at least one end of the cleaved products, thereby forming asecond set of combined products.

Also provided in some aspects are compositions comprising a) a firstpool of oligonucleotide species, where i) some or all of theoligonucleotides in the first pool of oligonucleotide species comprisean overhang capable of hybridizing to a target nucleic acid overhang,where each oligonucleotide species has a unique overhang sequence andlength, ii) each oligonucleotide in the first pool of oligonucleotidespecies comprises an oligonucleotide overhang identification sequencespecific to one or more features of the oligonucleotide overhang, andiii) each oligonucleotide in the first pool of oligonucleotide speciescomprises a first primer binding domain; and b) a second pool ofoligonucleotide species, where i) each oligonucleotide in the secondpool of oligonucleotide species comprises a first strand and a secondstrand, where the first strand is shorter than the second strand, andwhere the first strand and the second strand are complementary at afirst end of the oligonucleotide and the second strand comprises asingle strand at a second end of the oligonucleotide, ii) eacholigonucleotide in the first pool of oligonucleotide species comprisesan oligonucleotide identification sequence specific to the second poolof oligonucleotide species, and iii) each oligonucleotide in the secondpool of oligonucleotide species comprises a second primer binding domainon the second strand, where the first primer binding domain and thesecond primer binding domain are different.

Provided in some aspects are methods for producing a nucleic acidlibrary, comprising a) contacting a nucleic acid composition comprisingtarget nucleic acids with an agent comprising a phosphatase activityunder conditions in which target nucleic acids are dephosphorylated,thereby generating dephosphorylated target nucleic acids, where some orall of the target nucleic acids comprise an overhang; and b) combiningthe dephosphorylated target nucleic acids and a plurality ofoligonucleotide species, where i) some or all of the oligonucleotides inthe plurality of oligonucleotide species comprise an overhang capable ofhybridizing to a target nucleic acid overhang, where eacholigonucleotide species has a unique overhang sequence and length; ii)each oligonucleotide in the plurality of oligonucleotide speciescomprises an oligonucleotide overhang identification sequence specificto one or more features of the oligonucleotide overhang; and iii) thenucleic acid composition and the plurality of oligonucleotide species iscombined under conditions in which oligonucleotide overhangs hybridizeto target nucleic acid overhangs having a corresponding length, therebyforming hybridization products.

Provided in some aspects are methods for analyzing nucleic acidcomprising a) combining a nucleic acid composition comprising targetnucleic acids and a plurality of oligonucleotide species, where i) someor all of the target nucleic acids comprise an overhang; ii) some or allof the oligonucleotides in the plurality of oligonucleotide speciescomprise an overhang capable of hybridizing to a target nucleic acidoverhang, where each oligonucleotide species has a unique overhangsequence and length; iii) each oligonucleotide in the plurality ofoligonucleotide species comprises an oligonucleotide overhangidentification sequence specific to one or more features of theoligonucleotide overhang; and iv) the nucleic acid composition and theplurality of oligonucleotide species is combined under conditions inwhich oligonucleotide overhangs hybridize to target nucleic acidoverhangs having a corresponding length, thereby forming hybridizationproducts; b) sequencing the hybridization products, or amplificationproducts thereof, by a sequencing process, thereby generating sequencereads, where the sequence reads comprise forward sequence reads andreverse sequence reads; and c) analyzing overhang information associatedwith overhang identification sequences that indicate presence of anoverhang for the reverse sequence reads, thereby generating an analysis,and omitting from the analysis overhang information associated withoverhang identification sequences that indicate presence of an overhangfor the forward sequence reads.

Provided in some aspects are methods for assaying a population ofnucleic acids, comprising assaying nucleic acid overhangs of apopulation of nucleic acids in a sample, thereby generating an overhangprofile of the population; and based on the overhang profile,determining a characteristic of the sample.

Also provided are systems, machines and computer program products that,in some embodiments, carry out certain methods or parts of certainmethods described herein.

Certain embodiments are described further in the following description,examples, claims and drawings.

BRIEF DESCRIPTION OF THE DRAWINGS

The drawings illustrate certain embodiments of the technology and arenot limiting. For clarity and ease of illustration, the drawings are notmade to scale and, in some instances, various aspects may be shownexaggerated or enlarged to facilitate an understanding of particularembodiments.

FIG. 1A to FIG. 1C show examples of hairpin adapter configurations. FIG.1A shows the stem-loop structure of a sequencing adapter with Illuminapriming sites (P5 and P7) incorporated, with a single 3′ overhangingthymine (T) for thymine-adenine (TA) ligation and a phosphorylated 5′end. FIG. 1B shows an adapter that is not phosphorylated, but includesunique end identifiers (UEIs) indicating the type and length of overhang(OV) present. FIG. 1C shows an adapter that further includes uniquemolecular identifiers (UMIs). A phosphorothioate bond is present betweenthe last two bases on both ends of the oligo/adapter to prevent chewback from nuclease activity. *, phosphorothioate bond. G, guanine (RNAbase). T, thymine. UEI, unique end identifier. UMI, unique molecularidentifier. OV, overhang. P, phosphate. P5, Illumina P5 adaptersequence. P7, Illumina P7 adapter sequence.

FIG. 2A shows examples of adapters with a short insert (top) or a longinsert (bottom). An example of a short insert adapter with 5′ overhangsis shown on the top left and an example of a short insert adapter with3′ overhangs is shown on the top right. OV, overhang. UEI, unique endidentifier. A, adenine. U, uracil or deoxyuridine. FIG. 2B shows anexample workflow using a short insert adapter with 3′ overhangs. A firststep includes phosphorylating the DNA template. As illustrated here, anexample template has 3′ overhangs on top and bottom strands. A next stepincludes ligating the 3′ overhanging UEI adapters to the phosphorylatedtemplate. Nicks are present in the ligation product as shown. A furtherstep includes enzymatically cutting UEI adapter DNA strand atdeoxyuridine. A further step includes filling-in at nicks using astrand-displacing polymerase to form a complete double-strandedmolecule. Unligated or residual UEI adapters remaining intact orfilled-in can be eliminated. A, adenine. U, uracil or deoxyuridine. P,phosphate.

FIG. 3 shows mapped library molecule lengths from original DNA extract(not size-selected) and size selected via dual SPRI selection into highmolecular weight (HMW) and low molecular weight (LMW) fragments. Panel Ashows results for indiv1 all DNA, purification style 1. Panel B showsresults for indiv1 all DNA, purification style 2. Panel C shows resultsfor indiv1 HMW fraction. Panel D shows results for indiv1 LMW fraction.Panel E shows results for indiv2 LMW fraction. Panel F shows results forindiv2 HMW fraction.

FIG. 4 shows paired-end and mapped reads with unique end identifiers(UEIs, “barcodes”) present on no end (0), one end (1), or correctly onboth ends (2). Panel A shows results for indiv1 all DNA, purificationstyle 1. Panel B shows results for indiv1 all DNA, purification style 2.Panel C shows results for indiv1 HMW fraction. Panel D shows results forindiv1 LMW fraction. Panel E shows results for indiv2 LMW fraction.Panel F shows results for indiv2 HMW fraction.

FIG. 5 shows examples of unique end identifier (UEI) adapters with ablocker at various locations.

FIG. 6 shows an example workflow that includes ligating blocked uniqueend identifier (UEI) adapters to a phosphorylated template, filling inat nicks, and creating a blunt-ended and double-stranded molecule. P,phosphate. UEI, unique end identifier. Iso-dC, isodeoxy-cytosine.

FIG. 7A to FIG. 7C show tapestation results (i.e., Agilent Tapestation4200) depicting size distribution of cell-free DNA fragments insequencing libraries under three conditions: no phosphatase treatment(FIG. 7A), phosphatase treatment of cell-free DNA template (FIG. 7B),and phosphatase treatment of template and library adapters (FIG. 7C).Adapter-dimer artifacts are expected at ˜120 bp. One nucleosome isexpected to peak at 280-290 bp with increments of ˜170 bp for additionalnucleosomes. FIG. 7A to FIG. 7C demonstrate improvement followingphosphatase treatment illustrated by a reduction in adapter dimers andrelative increase in cell-free DNA-associated peaks.

FIG. 8A shows an example adapter set, with one 5′ palindromic overhangand with 5′ and 3′ random overhangs of varying lengths. FIG. 8B shows anoverhanging adapter set ligated to a long dsDNA fragment (high molecularweight (HMW) DNA template) with native 5′ and 3′ overhangs. Theillustration in FIG. 8B depicts, from top to bottom, examples after step3 (ligation), step 5 (shearing), and step 6 (isolate biotinylatedfragments) of a “mate pair” DNA preparation. UEI, unique end identifier.OV, overhang.

FIG. 9A shows an example method for attaching a unique end identifier(UEI) sequence in a first phase using strand-displacing polymerase, andsequencer-specific sequence (e.g., sequencing adapter) in a secondphase. Panel A shows a Y adapter (left) and a hairpin adapter (right)composed of a unique end identifier (UEI) sequence (shown in gray) andrandom sequence (shown in black). In some instances, the Y adapter is acleaved version of the hairpin adapter. Panel B shows ligation of thehairpin adapter to a target nucleic acid, which ligation product can becleaved. After cleavage, the ligation product is the same as theY-adapter ligation product. Panel C shows a fill-in step at nicks with astrand-displacing polymerase to create a fully complementarydouble-stranded, blunt-ended fragment. Panel D shows a nucleic acidfragment that is ready for any sequencing library preparation of choice(second phase). X, cleavable site(s). UEI, unique end identifier. OV,overhang. P, phosphate.

FIG. 9B shows an example method for attaching a Y adapter or a hairpinadapter to the ends of a native nucleic acid fragment. Panel A shows a Yadapter (left) and a hairpin adapter (right) composed of an overhang, aunique end identifier (UEI) sequence (shown in gray), and primingsequences (priming sequence 1 (e.g., Illumina P5 priming sequence) andpriming sequence 2 (e.g., Illumina P7 priming sequence); priming regionshown in black). Panel B shows ligation of the adapters to a targetnucleic acid. Because the adapters are not phosphorylated, the ligationonly occurs at the 5′ end of the template, leaving nicks. Panel C showsthat the nicks are repaired once the 5′ adapter strand is phosphorylatedand ligates the 3′ end of the adapter. After nick repair, the hairpinadapter ligation product can be cleaved (top) at the cleavage site.After cleavage, the ligation product is the same as the Y-adapterligation product (bottom). This method generates a double-strandednucleic acid fragment that is ready for any sequencing librarypreparation of choice (second phase) and/or sequencer of choice, whichmay depend on the priming sequences used. OV, overhang. P, phosphate.P1, priming sequence 1. P2, priming sequence 2.

FIGS. 10A-10C show example methods for attaching a unique end identifier(UEI) sequence to native DNA template using an oligonucleotide adapterhaving RNA bases in the overhang (“RNA overhangs”). FIG. 10A showsexample configurations of oligonucleotide adapters having RNA overhangs.Black regions represent non-complementary bases or blocking bases, withor without sequencer specific adapter sequences (e.g., P5, P7). FIGS.10B and 10C show example methods where RNA overhang ends are ligated tophosphorylated DNA templates, creating DNA-RNA duplexes. Nicks may berepaired by ligase or strand displacing fill-in, depending onoligonucleotide adapter configuration. Adapter dimers having doublestranded RNA (dsRNA) can be digested. X, cleavable site(s). UEI, uniqueend identifier. OV, overhang. P, phosphate.

FIG. 11 shows a method for attaching oligonucleotide adapters to highmolecular weight (HMW) DNA. FIG. 11 discloses SEQ ID NOS 53-54, 54, 53,53, 55-56, 55-56 and 55, respectively, in order of appearance.

FIG. 12 shows a method for attaching oligonucleotide adapters to highmolecular weight (HMW) DNA. FIG. 12 discloses SEQ ID NOS 53, 55, 53 and55, respectively, in order of appearance.

FIG. 13 shows oligonucleotide adapter designs. FIG. 13 discloses SEQ IDNOS 53, 55, 53, 55-56 and 55, respectively, in order of appearance.

FIG. 14 shows a method for attaching oligonucleotide adapters to highmolecular weight (HMW) DNA.

FIG. 15 shows results of sensitivity experiments. Overhang sequences areonly considered if they occur on reverse reads. Panel A: overhangcounts, divided by total per library, across two replicate libraries for100% mechanically sheared DNA. Values are means across two libraries;error bars show maximum and minimum value. Panel B: overhang counts,divided by total per library, across two replicate libraries for 100%MluCl digested DNA. Values are means across two libraries; error barsshow maximum and minimum value. Panel C: MluCl target sequence abundancewith increasing concentration of MluCl. As the percent of MluCl digestedDNA increases (x-axis), so does the frequency of its target sequence(AATT) among 5′ overhang sequences (y-axis). Where replicate librarieswere available, error bars show minimum and maximum values. Panel D:MluCl target sequence is identifiable even in 1% MluCl digested DNA.Points are counts of individual overhang sequences, divided by the sumof all such counts per library. Mean counts across two replicatelibraries of only mechanically sheared DNA (x-axis) are shown againstmean counts across two replicate libraries of 1% MluCl digested DNA(y-axis). The percent error of each count in 1% MluCl digested DNA wascomputed, using the count in mechanically sheared DNA as the expectedvalue. All sequences for which this value, rounded to the thousandthsplace, fell at or above the 99.9th percentile of the distribution areshown. The target sequence (AATT) has the highest percent error (6.2%;99.9th percentile; p<0.001).

FIG. 16 shows overhang profile and base composition of overhangs createdby a Micrococcal nuclease (restriction endonuclease MluCl). Results arethe average of two independent libraries; error bars on the overhangabundance plot show maximum and minimum value. Input DNA for librariesis human genomic DNA extracted from GM12878 cells.

FIG. 17 shows effect of blood collection tubes on human cfDNA length andoverhang profiles of control oligos. Difference between expected andobserved control oligo overhang lengths demonstrate loss in overhanglength in RTT by 4 hours and by 24 hours in YTT. Shown are frequenciesof difference from expected length between −1 (chewed back by one base)and −5 (the 99th percentile of the distribution). PBS, Phosphate BufferSaline pH 7.4 (control). RTT, red top tubes (serum). PTT, purple toptubes (potassium EDTA). YTT, yellow top tube (citrate). Control, controloligos without spiking or extraction.

FIG. 18 shows accuracy in overhang determination. UEI data from reverseread only vs. UEI from forward read only. Accuracy in overhangdetermination is highest when only not-blunt UEIs on reverse reads areconsidered. X-axis: percent of UEIs ligated to the correct end of thecorrect control oligo, excluding non-blunt UEIs on reverse reads.Y-axis: same value, but excluding non-blunt UEIs on forward reads.

FIG. 19 shows a schematic of proposed gaps and flaps. The librarypreparation protocol completes ligation in two separate reactions. Blackcircles represent the 1^(st) ligation, where the phosphates are presenton the 5′ end of the template. The adapters lack phosphates so a 2^(nd)ligation event (white circles) is required to add phosphates to the 5′end of the adapter, permitting a fully formed double-stranded librarymolecule. P5 adapters are at forward reads, P7 adapters are at reversereads. The following was observed: 1) an excess of only one of the twooriginal strands, and 2) P5 UEIs are more inaccurate than P7 UEIs.Together these observations revealed the presence of several failuremodes that may be caused by gaps and flaps of the overhangs duringligation. Given a template with one blunt end and one overhanging, asdepicted, several failure modes during adapter ligation can cause one ofthe two strands to be lost. The top panel shows an error mode where amismatch in the length of a 5′ overhang causes a gap. The bottom panelshows an error mode where a mismatch in the length of a 3′ overhangcauses a flap. In both cases, these errors force an ‘incorrect’ covalentbond during the 1st ligation (black), inhibiting the 2nd ligation(white). This leads to conversion of only one strand and the loss of theother strand. Furthermore, in these cases the P5 UEI will report thewrong overhang length but the P7 UEI will be correct. A much higheraccuracy of the P7 UEI was observed when they are blunt or overhangs;for this reason P7 UEIs were used during certain analyses. Althoughunlikely, if a gap at a 3′ overhang, or a flap at a 5′ overhang dooccur, neither strand would convert into the library.

FIG. 20 shows a heat map generated from sequencing data of DNA overhangspresent in each library produced using overhang adapters describedherein. The heat map was generated using Ward's hierarchical clusteringmethod. Each column represents a single cell-free DNA library from acancer donor (black bar) or healthy donor (no bar). Each row representsa unique overhang (5′ or 3′) of length 1 to 6 nucleotides; rows(overhangs) containing at least one CG dinucleotide, or CpG, areindicated by a grey bar. Within the heat map matrix, the darker thecolor, the increasing proportion (log scaled) that overhang representsin the library. Lighter colors indicate depletion of that overhang.Scale on bottom of figure; N=50 no cancer reported; N=21 cancer.

FIG. 21 shows variables used in certain models.

FIG. 22 shows a logistical regression classifier for cancer versushealthy samples.

FIG. 23 shows a classification report and receiver operatingcharacteristic (ROC) for cancer versus healthy samples.

FIG. 24 shows a model summary for gastrointestinal (GI) cancer versushealthy samples.

FIG. 25 shows a model summary for gastrointestinal (GI) cancer versusother samples (includes healthy and other cancer).

DETAILED DESCRIPTION

Provided herein are methods and compositions useful for analyzingnucleic acid. Also provided herein are methods and compositions usefulfor producing nucleic acid libraries. Also provided herein are methodsand compositions useful for analyzing ends of nucleic acid fragments. Incertain aspects, the methods include combining sample nucleic acid andoligonucleotides. In some embodiments, one or more oligonucleotidesinclude an overhang capable of hybridizing to an overhang in a samplenucleic acid. In some embodiments, one or more oligonucleotides includea blunt end capable of ligating to a blunt end in a sample nucleic acid.In some embodiments, oligonucleotides each include at least oneoligonucleotide overhang identification sequence. Oligonucleotides maycomprise overhangs of different lengths and different sequences, andoverhang identification sequences may be specific to the length ofcorresponding overhangs (and may be specific to other features of anoverhang). In some embodiments, oligonucleotides include a cleavagesite. In some embodiments, oligonucleotides are capable of forming ahairpin structure. In some embodiments, oligonucleotides comprise twostrands, with an overhang at a first end and two non-complementarystrands at a second end. In some embodiments, sample nucleic acid andoligonucleotides are combined under conditions in which overhangs in theoligonucleotides hybridize to overhangs in the sample nucleic acidhaving a corresponding length and complementary sequence, therebyforming hybridization products. In some embodiments, hybridizationproducts include circularized nucleic acid fragments. In someembodiments, methods include generating blunt-ended nucleic acidfragments. Such hybridization products and/or blunt-ended nucleic acidfragments may be useful for producing a nucleic acid library and/orfurther analysis or processing, for example.

Nucleic Acid Ends

Provided herein are methods and compositions for analyzing nucleicacids. Methods may comprise modifying and/or analyzing nucleic acidends. A nucleic acid end refers to the terminus of a nucleic acidfragment. Generally, a linear nucleic acid fragment contains two termini(i.e., a beginning and an end). Such termini are often referred to as a5′ end and a 3′ end. A non-linear fragment may contain more than twotermini (e.g., a forked fragment may contain 3 or more termini). For adouble-stranded fragment, a nucleic acid end may contain an overhang ormay be blunt ended (i.e., contains no overhang). The term overhang oroverhang region generally refers to a single stranded portion at anucleic acid end. For example, a nucleic acid fragment may include adouble stranded or “duplex” region comprising one or more pairednucleotides (bases) and a single stranded or “overhang” regioncomprising one or more unpaired nucleotides (bases). Typically, anoverhang refers to a single stranded region at an end of a nucleic acidmolecule and not to a single stranded region flanked by double strandedregions. An overhang may be a 5′ overhang or a 3′ overhang. A 5′overhang generally refers to a single stranded region at the end of anucleic acid molecule that reads according to conventional nucleic aciddirectionality in a 3′ to 5′ direction starting at the junction wherethe duplex portion ends and the single stranded portion begins andending at the terminus (free end) of the overhang. A 3′ overhanggenerally refers to a single stranded region at the end of a nucleicacid molecule that reads according to conventional nucleic aciddirectionality in a 5′ to 3′ direction starting at the junction wherethe duplex portion ends and the single stranded portion begins andending at the terminus (free end) of the overhang.

Target nucleic acids may comprise an overhang (e.g., at end of a nucleicacid fragment) and may comprise two overhangs (e.g., at both ends of anucleic acid fragment). Target nucleic acids may comprise two overhangs,one overhang and one blunt end, two blunt ends, or a combination ofthese. Target nucleic acids may comprise two 3′ overhangs, two 5′overhangs, one 3′ overhang and one 5′ overhang, one 3′ overhang and oneblunt end, one 5′ overhang and one blunt end, two blunt ends, or acombination of these. In some embodiments, overhangs in target nucleicacids are native overhangs. In some embodiments, target nucleic acidends are native blunt ends. Native overhangs and native blunt endsgenerally refer to overhangs and blunt ends that have not been modified(e.g., have not been filled in, have not been cleaved or digested (e.g.,by an endonuclease or exonuclease), have not been added or added to)prior to combining a sample composition with oligonucleotides describedherein. Often, native overhangs and native blunt ends generally refer tooverhangs and blunt ends that have not been modified ex vivo (e.g., havenot been filled in ex vivo, have not been cleaved or digested ex vivo(e.g., by an endonuclease or exonuclease), have not been added or addedto ex vivo) prior to combining a sample composition witholigonucleotides described herein. In certain instances, nativeoverhangs and native blunt ends generally refer to overhangs and bluntends that have not been modified after collection from a subject orsource (e.g., have not been filled in after collection from a subject orsource, have not been cleaved or digested after collection from asubject or source (e.g., by an endonuclease or exonuclease), have notbeen added or added to after collection from a subject or source).Native overhangs and native blunt ends generally do not includeoverhangs/ends created by contacting an isolated sample with a cleavageagent (e.g., endonuclease, exonuclease, restriction enzyme), and/or apolymerase. Native overhangs and native blunt ends generally do notinclude overhangs/ends created by mechanical shearing (e.g.,ultrasonication (e.g., Adaptive Focused Acoustics™ (AFA) process byCovaris)). Native overhangs and native blunt ends generally do notinclude overhangs/ends created by contacting an isolated sample with anexonuclease (e.g., DNAse). Native overhangs and native blunt endsgenerally do not include overhangs/ends created by amplification (e.g.,polymerase chain reaction). Native overhangs and native blunt endsgenerally do not include overhangs/ends attached to a solid support,conjugated to another molecule, or cloned into a vector. In someembodiments, native overhangs and native blunt ends may be subjected todephosphorylation and may be referred to as dephosphorylated nativeoverhangs and dephosphorylated native blunt ends. In some embodiments,native overhangs and native blunt ends may be subjected tophosphorylation and may be referred to as phosphorylated nativeoverhangs and phosphorylated native blunt ends.

Oligonucleotides

In some embodiments, nucleic acids (e.g., nucleic acids from a sample;target nucleic acids) are combined with oligonucleotides. Anoligonucleotide generally refers to a nucleic acid (e.g., DNA, RNA)polymer that is distinct from the target nucleic acids, and may bereferred to as oligos, adapters, oligonucleotide adapters, and oligoadapters. Oligonucleotides may be short in length (e.g., less than 50bp, less than 40 bp, less than 30 bp, less than 20 bp, less than 10 bp,less than 5 bp) and sometimes, but not always, are shorter than targetnucleic acids. Oligonucleotides may be artificially synthesized. In someembodiments, nucleic acids (e.g., nucleic acids from a sample; targetnucleic acids) are combined with a plurality or pool of oligonucleotidespecies. A pool of oligonucleotide species may be referred to as a setof oligonucleotide species, and may comprise a plurality of differentoligonucleotide species. Methods and compositions herein may includemore than one pool of oligonucleotide species (e.g., a first pool ofoligonucleotide species and a second pool of oligonucleotide species).In such instances, oligonucleotides in a first pool may share a commonfeature and oligonucleotides in a second pool may share a differentcommon feature. A common feature in a pool may include a particulardomain and/or a particular modification. In some embodiments, a commonfeature in a pool includes a common primer binding domain.

A species of oligonucleotide generally contains a feature that is uniquewith respect to other oligonucleotide species. For example, anoligonucleotide species may contain a unique overhang feature. A uniqueoverhang feature may include a unique overhang length, a unique overhangsequence, or a combination of a unique overhang sequence and overhanglength. For example, an oligonucleotide species may contain a uniquesequence for a particular overhang length with respect to otheroligonucleotide species having the given overhang length. In someinstances, an oligonucleotide species contains a unique sequence for aparticular overhang length and type (e.g., 5′ or 3′) with respect toother oligonucleotide species having the given overhang length and type.

Oligonucleotides may comprise an overhang (e.g., at one end of theoligonucleotide) and may comprise two overhangs (e.g., at both ends ofthe oligonucleotide). In some embodiments, oligonucleotides comprise twooverhangs, one overhang and one blunt end, two blunt ends, or acombination of these. In some embodiments, oligonucleotides comprise two3′ overhangs, two 5′ overhangs, one 3′ overhang and one 5′ overhang, one3′ overhang and one blunt end, one 5′ overhang and one blunt end, twoblunt ends, or a combination of these. In some embodiments,oligonucleotides comprise two strands, with an overhang or blunt end ata first end and two non-complementary strands at a second end. Forhairpin structure oligonucleotides described herein, sucholigonucleotides (e.g., in the uncleaved state) generally comprise oneoverhang (e.g., a 5′ overhang or a 3′ overhang), and in certaininstances, no overhang (i.e., a blunt end). Generally, anoligonucleotide overhang is capable of hybridizing to a target nucleicacid overhang. An oligonucleotide overhang may comprise a region that iscomplementary to a region in a target nucleic acid overhang. In someembodiments, the entire length of an oligonucleotide overhang is capableof hybridizing to the entire length of a target nucleic acid overhang.Thus, the entire oligonucleotide overhang may be complementary to theentire nucleic acid overhang.

Often, “complementary” or “complementarity” refers sequencecomplementarity, as described herein, and “non-complementary” or“non-complementarity” refers to sequence non-complementarity, asdescribed herein. In certain aspects, “complementary” or“complementarity” may refer structural complementarity (e.g., overhangcomplementarity). For example, a target nucleic acid having a 5′, 8base-pair overhang may have structural complementarity with anoligonucleotide having a 5′, 8 base-pair overhang. Structuralcomplementarity may include non-specific base pairing. In certainembodiments, an oligonucleotide overhang comprises one or morenucleotides capable of non-specific base pairing to bases in the targetnucleic acids. For example, a target nucleic acid having a 5′, 8base-pair overhang may have structural complementarity with anoligonucleotide having a 5′, 8 base-pair overhang, where theoligonucleotide overhang comprises one or more nucleotides that can pairnon-specifically with all or some of the base possibilities at acorresponding position in the target nucleic acid overhang. In certainembodiments, an oligonucleotide overhang comprises nucleotides that areall capable of non-specific base pairing to bases in the target nucleicacids. Nucleotides capable of non-specific base pairing may be referredto as “universal bases” which can replace any of the four typical basesdescribed above (e.g., nitroindole, 5-nitroindole, 3-nitropyrrole,inosine, deoxyinosine, 2-deoxyinosine) or “degenerate/wobble bases”which can replace two or three (but not all) of the four typical bases(e.g., non-natural base P and K). In certain embodiments, anoligonucleotide overhang comprises one or more universal bases. Incertain embodiments, an oligonucleotide overhang consists of universalbases.

In some embodiments, each oligonucleotide in a plurality or pool ofoligonucleotide species comprises an oligonucleotide overhangidentification sequence specific to one or more features of theoligonucleotide overhang. An oligonucleotide overhang identificationsequence may be referred to as an overhang identification sequence, anidentification sequence, an oligonucleotide overhang identificationpolynucleotide, an overhang identification polynucleotide, anidentification polynucleotide, a barcode, a variable overhang barcode, aunique end identifier (UEI), an end identifier, or an identifier. Anoverhang identification sequence uniquely identifies the overhangpresent in its respective oligonucleotide, and can uniquely identifyeach type of overhang (e.g., length, 5′ or 3′, and/or the like) presentin target nucleic acids to which the oligonucleotide overhangsspecifically hybridize. In certain embodiments, an overhangidentification sequence can uniquely identify each type of nativeoverhang (e.g., length, 5′ or 3′, and/or the like) present in targetnucleic acids to which the oligonucleotide overhangs specificallyhybridize. Often, overhang identification sequences specific tooligonucleotide overhangs that hybridize to overhangs of differentlengths are different from one another and are unique. Typically,overhang identification sequences specific to i) oligonucleotideoverhangs that hybridize to overhangs of different lengths; and ii)oligonucleotide overhangs of different type (i.e., 3′, 5′), aredifferent from one another and are unique. Generally, no two overhangidentification sequences specific to the length of an oligonucleotideoverhang are in the plurality or pool of oligonucleotide species thathave overhangs of a different length. In other words, a given overhangidentification sequence (or set of sequences) that is specific to agiven length of an oligonucleotide overhang will only be present inoligonucleotides having overhangs of such given length. Oligonucleotideshaving a different overhang length will include a different overhangidentification sequence (or set of sequences). In some embodiments,there is one overhang identification sequence for all oligonucleotidespecies having an overhang of a specific length. In some embodiments,there are two overhang identification sequences for all oligonucleotidespecies having an overhang of a specific length such that one overhangidentification sequence is specific to the given length for 5′ overhangsand the other overhang identification sequence is specific to the givenlength for 3′ overhangs. In some embodiments, there are one or twooverhang identification sequence(s) for all oligonucleotide specieshaving an overhang of a specific length, irrespective of the sequence ofthe overhang. In some embodiments, there is a subset of overhangidentification sequences for oligonucleotide species having an overhangof a specific length, where different overhang identification sequencesin the subset are specific to different overhang sequences in theoligonucleotides (e.g., in addition to being specific to the length andtype (i.e., 5′ or 3′) of overhang). In some embodiments, an overhangidentification sequence is specific to no overhang (i.e., a blunt endedoligonucleotide).

Generally, an overhang identification sequence is informative about thelength and/or type of corresponding oligonucleotide overhang by way ofthe nucleotide sequence of the overhang identification sequence. Thenucleotide sequence of the overhang identification sequence may besequenced by a sequencing process and included in sequence reads for theoligonucleotide-target sequences. Thus, in certain embodiments, overhangidentification sequences do not generate additional signals beyond readsof their nucleotide sequences. For example, overhang identificationsequences may not require labeling (e.g., by fluorescent labels),conjugation (e.g., to solid supports, antibodies), or hybridization to apolynucleotide carrying a label or conjugated to a solid support,antibody, and the like, to generate a signal.

In some embodiments, oligonucleotides include one or more portions ordomains other than the overhang and the overhang identificationsequence. Such additional portions may be included, for example, tofacilitate one or more downstream applications that utilize or furtherprocess the hybridization products or derivatives thereof, such asnucleic acid amplification, sequencing (e.g., high-throughputsequencing), or both. In certain embodiments, an additional portionincludes one or more nucleic acid binding domains such as, for example,primer binding domains (also referred to as priming sequences), and/or asequencing adapter or one or more components of a sequencing adapter(e.g., one or more components described herein). In some embodiments, anoligonucleotide comprises a unique molecular identifier (UMI). UMIsgenerally are used for estimating the number of unique startingmolecules (e.g., starting molecules prior to amplification) and, incertain instances, evaluating the sensitivity of a ligation reaction.

In some embodiments, oligonucleotides include one or more primer bindingdomains. A primer binding domain is a polynucleotide to which a primer(e.g., an amplification primer) can anneal. A primer binding domaintypically comprises a nucleotide sequence that is complementary orsubstantially complementary to the nucleotide sequence of a primer(e.g., an amplification primer). In some embodiments, different pools ofoligonucleotide species may comprise oligonucleotides having primerbinding domains, where each pool has its own primer binding domain. Forexample, oligonucleotides in pool A may comprise primer binding domainA, and oligonucleotides in pool B may comprise primer binding domain B,where primer binding domain A and primer binding domain B are different.Primer binding domain A and primer binding domain B may be considereddifferent based on their nucleotide sequences being different. Primerbinding domain A and primer binding domain B may be considered differentbased on the characteristic of primer A anneals to primer binding domainA and does not anneal to primer binding domain B, and primer B annealsto primer binding domain B and does not anneal to primer binding domainA.

In some embodiments, oligonucleotides include one overhang thathybridizes to target nucleic acid overhangs or includes a blunt end, andanother overhang containing a sequence that does not hybridize to targetnucleic acid overhangs. Such sequence that does not hybridize to targetnucleic acid overhangs may contain a sequence that is generally notfound in the target nucleic acid. Such sequence that does not hybridizeto target nucleic acid overhangs also may contain a sequence that canhybridize to itself. For example, a sequence may include a palindromicsequence. Oligonucleotides containing overhangs having a palindromicsequence may hybridize to each end of a target nucleic acid by way ofoverhang hybridization, for example, and then hybridize to each other byway of palindromic sequence hybridization, forming a circularhybridization product.

In some embodiments, an oligonucleotide overhang comprises any suitabletype of nucleotide (e.g., DNA nucleotides, RNA nucleotides, modifiednucleotides, natural nucleotides), examples of which are providedherein. In some embodiments, an oligonucleotide overhang comprises oneor more DNA nucleotides. In some embodiments, an oligonucleotideoverhang consists of DNA nucleotides. In some embodiments, anoligonucleotide overhang comprises one or more RNA nucleotides. In someembodiments, an oligonucleotide overhang consists of RNA nucleotides.Oligonucleotide overhangs comprising or consisting of RNA nucleotides,for example, may hybridize to target nucleic acid overhangs comprisingor consisting of DNA nucleotides, thereby forming an RNA-DNA duplex. AnRNA ligase (e.g., T4 RNA ligase 2, SplintR® Ligase) may be used in suchinstances for ligation. In certain embodiments, unligated oligo dimerproducts (e.g., containing RNA-RNA duplexes) may be removed by digestingRNA-RNA duplexes (e.g., using an RNAse such as, for example RNAse III).

Y-Oligonucleotides

In some embodiments, oligonucleotides comprise two strands, with anoverhang at a first end and two non-complementary strands at a secondend. Such oligonucleotides may be referred to as Y-oligonucleotides,Y-adapters, Y-shaped oligonucleotides, Y-shaped adapters, and the like.In some embodiments, oligonucleotides (e.g., Y-adapters) comprise twostrands, with either a blunt end or an overhang at a first end and twonon-complementary strands at a second end. An oligonucleotide having aY-shaped structure generally comprises a double-stranded duplex region,two single stranded “arms” at one end, and either a blunt end or anoverhang at the other end.

Y-oligonucleotides may comprise a plurality of polynucleotides. In someembodiments, Y-oligonucleotides comprise a first polynucleotide and asecond polynucleotide. In some embodiments, a first polynucleotide (of afirst strand) is complementary to a second polynucleotide (of a secondstrand). In some embodiments, a portion of a first polynucleotide (of afirst strand) is complementary to a portion of a second polynucleotide(of a second strand). In some embodiments, a first polynucleotidecomprises a first region that is complementary to a first region in asecond polynucleotide, and the first polynucleotide comprises a secondregion that is not complementary to a second region in the secondpolynucleotide. The complementary region often forms the duplex regionof the Y-oligonucleotide and the non-complementary region often formsthe arms, or parts thereof, of the Y-oligonucleotide. The first andsecond polynucleotides may comprise components of adapters describedherein, such as, for example, amplification priming sites and/orspecific sequencing adapters (e.g., P5, P7 adapters). In someembodiments, the first and second polynucleotides do not comprisecertain components of adapters described herein, such as, for example,amplification priming sites and specific sequencing adapters (e.g., P5,P7 adapters).

In some embodiments, a Y-oligonucleotide comprises an overhang (e.g., 5′overhang, 3′ overhang). The overhang of a Y-oligonucleotide typically islocated adjacent to the double-stranded duplex portion and at theopposite end of the non-complementary strands (or “arms”) portion. Theoverhang of a Y-oligonucleotide typically is complementary to anoverhang in a target nucleic acid. Y-oligonucleotides may also comprisean overhang identification sequence. In some embodiments, aY-oligonucleotide comprises a blunt end opposite to thenon-complementary strands (or “arms”) portion. In some embodiments, aplurality or pool of Y-oligonucleotide species comprises a mixtureof: 1) oligonucleotides comprising an overhang; and 2) oligonucleotidescomprising a blunt end.

Hairpins

In some embodiments, an oligonucleotide comprises one strand capable offorming a hairpin structure having a single-stranded loop. In someembodiments, an oligonucleotide consists of one strand capable offorming a hairpin structure having a single-stranded loop. Anoligonucleotide having a hairpin structure generally comprises adouble-stranded “stem” region and a single stranded “loop” region. Insome embodiments, an oligonucleotide comprises one strand (i.e., onecontinuous strand) capable of adopting a hairpin structure. In someembodiments, an oligonucleotide consists essentially of one strand(i.e., one continuous strand) capable of adopting a hairpin structure.Consisting essentially of one strand means that the oligonucleotide doesnot include any additional strands of nucleic acid (e.g., hybridized tothe oligonucleotides) that are not part of the continuous strand. Thus,“consisting essentially of” here refers to the number of strands in theoligonucleotides, and the oligonucleotides can include other featuresnot essential to the number of strands (e.g., can include a detectablelabel, can include other regions). Oligonucleotides comprising orconsisting essentially of one strand capable of forming a hairpinstructure may be referred to herein as hairpins, hairpinoligonucleotides, or hairpin adapters.

Hairpin oligonucleotides may comprise a plurality of polynucleotideswithin the one strand. In some embodiments, hairpin adapters comprise afirst polynucleotide and a second polynucleotide. In some embodiments, afirst polynucleotide is complementary to a second polynucleotide. Insome embodiments, a portion of a first polynucleotide is complementaryto a portion of a second polynucleotide. In some embodiments, a firstpolynucleotide comprises a first region that is complementary to a firstregion in a second polynucleotide, and the first polynucleotidecomprises a second region that is not complementary to a second regionin the second polynucleotide. The complementary region often forms thestem of the hairpin adapter and the non-complementary region often formsthe loop, or part thereof, of the hairpin adapter. The first and secondpolynucleotides may comprise components of adapters described herein,such as, for example, amplification priming sites and specificsequencing adapters (e.g., P5, P7 adapters). In some embodiments, thefirst and second polynucleotides do not comprise certain components ofadapters described herein, such as, for example, amplification primingsites and specific sequencing adapters (e.g., P5, P7 adapters).

Hairpin oligonucleotides may comprise one or more cleavage sites capableof being cleaved under cleavage conditions. In some embodiments, acleavage site is located between a first and second polynucleotide.Cleavage at a cleavage site often generates two separate strands fromthe hairpin oligonucleotide. In some embodiments, cleavage at a cleavagesite generates a partially double stranded oligonucleotide with twounpaired strands forming a “Y” structure. Cleavage sites may include anysuitable cleavage site, such as cleavage sites described herein, forexample. In some embodiments, cleavage sites comprise RNA nucleotidesand may be cleaved, for example, using an RNAse. In some embodiments,cleavage sites comprise uracil and/or deoxyuridine and may be cleaved,for example, using DNA glycosylase, endonuclease, RNAse, and the likeand combinations thereof. In some embodiments, cleavage sites do notcomprise uracil and/or deoxyuridine. In some embodiments, a methodherein comprises after combining hairpin oligonucleotides with targetnucleic acids, exposing one or more cleavage sites to cleavageconditions, thereby cleaving the oligonucleotides.

In some embodiments, a hairpin oligonucleotide comprises an overhang(e.g., 5′ overhang, 3′ overhang). The overhang of a hairpinoligonucleotide typically is located adjacent to the double-strandedstem portion and at the opposite end of the loop portion. The overhangof a hairpin oligonucleotide typically is complementary to an overhangin a target nucleic acid. Hairpin oligonucleotides may also comprise anoverhang identification sequence. In some embodiments, a hairpinoligonucleotide comprises in a 5′ to 3′ orientation: a first overhangidentification sequence, a first polynucleotide, one or more cleavagesites, a second polynucleotide, a second overhang identificationsequence complementary to the first overhang identification sequence,and an overhang. In some embodiments, a hairpin oligonucleotidecomprises in a 5′ to 3′ orientation: an overhang, a first overhangidentification sequence, a first polynucleotide, one or more cleavagesites, a second polynucleotide, and an overhang identification sequencecomplementary to the first overhang identification sequence. In someembodiments, a plurality or pool of hairpin oligonucleotide speciescomprises a mixture of: 1) oligonucleotides comprising in a 5′ to 3′orientation: a first overhang identification sequence, a firstpolynucleotide, one or more cleavage sites, a second polynucleotide, asecond overhang identification sequence complementary to the firstoverhang identification sequence, and an overhang; and 2)oligonucleotides comprising in a 5′ to 3′ orientation: an overhang, afirst overhang identification sequence, a first polynucleotide, one ormore cleavage sites, a second polynucleotide, and an overhangidentification sequence complementary to the first overhangidentification sequence. In certain embodiments of the above, the firstand second polynucleotides are ordered in a 5′ to 3′ orientation asfollows: first portion of first polynucleotide, second portion of firstpolynucleotide, cleavage site, second portion of second polynucleotideand first portion of second polynucleotide, where the first portions ofeach polynucleotide are complementary and the second portions of eachpolynucleotide are not complementary. In some embodiments, a pluralityor pool of hairpin oligonucleotide species comprises a mixture of: 1)oligonucleotides comprising an overhang; and 2) oligonucleotidescomprising a blunt end.

Modified Nucleotides

In some embodiments, an oligonucleotide species comprises one or moremodified nucleotides. Modified nucleotides may be referred to asmodified bases and may include, for example, nucleotides conjugated to amember of a binding pair, blocked nucleotides, non-natural nucleotides,nucleotide analogues, peptide nucleic acid (PNA) nucleotides, Morpholinonucleotides, locked nucleic acid (LNA) nucleotides, bridged nucleic acid(BNA) nucleotides, glycol nucleic acid (GNA) nucleotides, threosenucleic acid (TNA) nucleotides, and the like and combinations thereof.In some embodiments, an oligonucleotide species comprises one or moremodified nucleotides within a duplex region, within an overhang region,at one end, or at both ends of the oligonucleotide. In some embodiments,an oligonucleotide species comprises one or more unpaired modifiednucleotides. In some embodiments, an oligonucleotide species comprisesone or more unpaired modified nucleotides at one end of theoligonucleotide. In some embodiments, an oligonucleotide speciescomprises one or more unpaired modified nucleotides the end of theoligonucleotide opposite to the end that hybridizes to a target nucleicacid (e.g., an end comprising an oligonucleotide overhang). A modifiednucleotide may be present at the end of the strand having a 3′ terminusor at the end of the strand having a 5′ terminus.

In some embodiments, an oligonucleotide species comprises one or moreblocked nucleotides. For example, an oligonucleotide species maycomprise one or more modified nucleotides that are capable of blockinghybridization to a nucleotide in a target nucleic acid. In someinstances, the one or more modified nucleotides are capable of blockingligation to a nucleotide in a target nucleic acid. In some embodiments,an oligonucleotide species comprises one or more modified nucleotidesthat are incapable of binding to a natural nucleotide. In someembodiments, one or more modified nucleotides comprise one or more of anisodeoxy-base, a dideoxy-base, an inverted dideoxy-base, a spacer, andan amino linker.

In some embodiments, one or more modified nucleotides comprise anisodeoxy-base. In some embodiments, one or more modified nucleotidescomprise isodeoxy-guanine (iso-dG). In some embodiments, one or moremodified nucleotides comprise isodeoxy-cytosine (iso-dC). Iso-dC andiso-dG are chemical variants of cytosine and guanine, respectively.Iso-dC can hydrogen bond with iso-dG but not with unmodified guanine(natural guanine). Iso-dG can base pair with Iso-dC but not withunmodified cytosine (natural cytosine). An oligonucleotide containingiso-dC can be designed so that it hybridizes to a complementary oligocontaining iso-dG but cannot hybridize to any naturally occurringnucleic acid sequence.

In some embodiments, one or more modified nucleotides comprise adideoxy-base. In some embodiments, one or more modified nucleotidescomprise dideoxy-cytosine. In some embodiments, one or more modifiednucleotides comprise an inverted dideoxy-base. In some embodiments, oneor more modified nucleotides comprise inverted dideoxy-thymine. Forexample, an inverted dideoxy-thymine located at the 5′ end of a sequencecan prevent unwanted 5′ ligations.

In some embodiments, one or more modified nucleotides comprise a spacer.In some embodiments, one or more modified nucleotides comprise a C3spacer. A C3 spacer phosphoramidite can be incorporated internally or atthe 5′-end of an oligonucleotide. Multiple C3 spacers can be added ateither end of an oligonucleotide to introduce a long hydrophilic spacerarm (e.g., for the attachment of fluorophores or other pendent groups).Other spacers include, for example, photo-cleavable (PC) spacers,hexanediol, spacer 9, spacer 18, 1′,2′-dideoxyribose (dSpacer), and thelike.

In some embodiments, a modified nucleotide comprises a member of abinding pair. Binding pairs may include, for example, antibody/antigen,antibody/antibody, antibody/antibody fragment, antibody/antibodyreceptor, antibody/protein A or protein G, hapten/anti-hapten,biotin/avidin, biotin/streptavidin, folic acid/folate binding protein,vitamin B12/intrinsic factor, chemical reactive group/complementarychemical reactive group, digoxigenin moiety/anti-digoxigenin antibody,fluorescein moiety/anti-fluorescein antibody, steroid/steroid-bindingprotein, operator/repressor, nuclease/nucleotide, lectin/polysaccharide,active compound/active compound receptor, hormone/hormone receptor,enzyme/substrate, oligonucleotide or polynucleotide/its correspondingcomplement, the like or combinations thereof. In some embodiments, amodified nucleotide comprises biotin.

In some embodiments, a modified nucleotide comprises a first member of abinding pair (e.g., biotin); and a second member of a binding pair(e.g., streptavidin) is conjugated to a solid support or substrate. Asolid support or substrate can be any physically separable solid towhich a member of a binding pair can be directly or indirectly attachedincluding, but not limited to, surfaces provided by microarrays andwells, and particles such as beads (e.g., paramagnetic beads, magneticbeads, microbeads, nanobeads), microparticles, and nanoparticles. Solidsupports also can include, for example, chips, columns, optical fibers,wipes, filters (e.g., flat surface filters), one or more capillaries,glass and modified or functionalized glass (e.g., controlled-pore glass(CPG)), quartz, mica, diazotized membranes (paper or nylon),polyformaldehyde, cellulose, cellulose acetate, paper, ceramics, metals,metalloids, semiconductive materials, quantum dots, coated beads orparticles, other chromatographic materials, magnetic particles; plastics(including acrylics, polystyrene, copolymers of styrene or othermaterials, polybutylene, polyurethanes, TEFLON™, polyethylene,polypropylene, polyamide, polyester, polyvinylidenedifluoride (PVDF),and the like), polysaccharides, nylon or nitrocellulose, resins, silicaor silica-based materials including silicon, silica gel, and modifiedsilicon, Sephadex®, Sepharose®, carbon, metals (e.g., steel, gold,silver, aluminum, silicon and copper), inorganic glasses, conductingpolymers (including polymers such as polypyrole and polyindole); microor nanostructured surfaces such as nucleic acid tiling arrays, nanotube,nanowire, or nanoparticulate decorated surfaces; or porous surfaces orgels such as methacrylates, acrylamides, sugar polymers, cellulose,silicates, or other fibrous or stranded polymers. In some embodiments, asolid support or substrate may be coated using passive orchemically-derivatized coatings with any number of materials, includingpolymers, such as dextrans, acrylamides, gelatins or agarose. Beadsand/or particles may be free or in connection with one another (e.g.,sintered). In some embodiments, a solid support can be a collection ofparticles. In some embodiments, the particles can comprise silica, andthe silica may comprise silica dioxide. In some embodiments, the silicacan be porous, and in certain embodiments the silica can be non-porous.In some embodiments, the particles further comprise an agent thatconfers a paramagnetic property to the particles. In certainembodiments, the agent comprises a metal, and in certain embodiments theagent is a metal oxide, (e.g., iron or iron oxides, where the iron oxidecontains a mixture of Fe2+ and Fe3+). A member of a binding pair may belinked to a solid support by covalent bonds or by non-covalentinteractions and may be linked to a solid support directly or indirectly(e.g., via an intermediary agent such as a spacer molecule or biotin).

Phosphorylation and Dephosphorylation

In some embodiments, a method herein comprises contacting a targetnucleic acid composition with an agent comprising a phosphatase activityunder conditions in which target nucleic acids are dephosphorylated,thereby generating a dephosphorylated target nucleic acid composition.In some embodiments, a method herein comprises contactingoligonucleotides with an agent comprising a phosphatase activity underconditions in which the oligonucleotides are dephosphorylated, therebygenerating a plurality or pool of dephosphorylated oligonucleotidespecies. Generally, target nucleic acids and/or oligonucleotides aredephosphorylated prior to a combining step (i.e., prior tohybridization). Target nucleic acids may be dephosphorylated and thensubsequently phosphorylated prior to a combining step (i.e., prior tohybridization). Oligonucleotides may be dephosphorylated and thensubsequently phosphorylated prior to a combining step (i.e., prior tohybridization). Oligonucleotides may be dephosphorylated and then notphosphorylated prior to a combining step (i.e., prior to hybridization).Reagents and kits for carrying out dephosphorylation of nucleic acidsare known and available. For example, target nucleic acids and/oroligonucleotides can be treated with a phosphatase (i.e., an enzyme thatuses water to cleave a phosphoric acid monoester into a phosphate ionand an alcohol).

In some embodiments, a method herein comprises contacting a targetnucleic acid composition with an agent comprising a phosphoryl transferactivity under conditions in which a 5′ phosphate is added to a 5′ endof target nucleic acids. In some embodiments, a method herein comprisescontacting dephosphorylated target nucleic acids with an agentcomprising a phosphoryl transfer activity under conditions in which a 5′phosphate is added to a 5′ end of target nucleic acids. In someembodiments, a method herein comprises contacting oligonucleotides withan agent comprising a phosphoryl transfer activity under conditions inwhich a 5′ phosphate is added to a 5′ end of oligonucleotide species. Insome embodiments, a method herein comprises contacting dephosphorylatedoligonucleotides with an agent comprising a phosphoryl transfer activityunder conditions in which a 5′ phosphate is added to a 5′ end ofoligonucleotide species. Generally, target nucleic acids and/oroligonucleotides are phosphorylated prior to a combining step (i.e.,prior to hybridization). 5′ phosphorylation of nucleic acids can beconducted by a variety of techniques. For example, target nucleic acidsand/or oligonucleotides can be treated with a polynucleotide kinase(PNK) (e.g., T4 PNK), which catalyzes the transfer and exchange of Pifrom the y position of ATP to the 5′-hydroxyl terminus ofpolynucleotides (double- and single-stranded DNA and RNA) and nucleoside3′-monophosphates. Suitable reaction conditions include, e.g.,incubation of the nucleic acids with PNK in 1×PNK reaction buffer (e.g.,70 mM Tris-HCl, 10 mM MgCl₂, 5 mM DTT, pH 7.6 @ 25° C.) for 30 minutesat 37° C.; and incubation of the nucleic acids with PNK in T4 DNA ligasebuffer (e.g., 50 mM Tris-HCl, 10 mM MgCl₂, 1 mM ATP, 10 mM DTT, pH 7.5 @25° C.) for 30 minutes at 37° C. Optionally, following thephosphorylation reaction, the PNK may be heat inactivated, e.g., at 65°C. for 20 minutes. In some embodiments, methods do not include producingthe 5′ phosphorylated nucleic acids by phosphorylating the 5′ ends ofnucleic acids from a nucleic acid sample. In certain instances, anucleic acid sample comprises nucleic acids with natively phosphorylated5′ ends. In some embodiments, methods do not include producing the 5′phosphorylated oligonucleotides by phosphorylating the 5′ ends ofoligonucleotides.

Hybridization and Ligation

Nucleic acid fragments may be combined with oligonucleotides therebygenerating combined products. Combining nucleic acid fragments witholigonucleotides may comprise one or more of overhang hybridization,ligation (e.g., ligation of hybridization products), and blunt-endligation. A combined product may include a nucleic acid fragmentconnected to (e.g., hybridized to and/or ligated to) an oligonucleotideat one or both ends of the nucleic acid fragment. In some embodiments,target nucleic acids may be combined with oligonucleotides therebygenerating combined products. In some embodiments, products from acleavage step (i.e., cleaved products) may be combined witholigonucleotides thereby generating combined products. Certain methodsherein comprise generating sets of combined products (e.g., a first setof combined products and a second set of combined products). In someembodiments, a first set of combined products includes target nucleicacids connected to (e.g., hybridized to and/or ligated to)oligonucleotides from a first pool of oligonucleotides. In someembodiments, a second set of combined products includes cleaved productsconnected to (e.g., hybridized to and/or ligated to) oligonucleotidesfrom a second pool of oligonucleotides.

Target nucleic acids may be combined with oligonucleotides underhybridization conditions, thereby generating hybridization products. Theconditions during the combining step are those conditions in whicholigonucleotides (e.g., oligonucleotide overhangs) specificallyhybridize to target nucleic acids having overhangs or overhang regionsthat are complementary in sequence and have corresponding lengths withrespect to the oligonucleotide overhangs. In some embodiments,corresponding length generally refers to the same length (i.e., the samenumber of bases in the oligonucleotide overhang and the target nucleicacid overhang). Specific hybridization may be affected or influenced byfactors such as the degree of complementarity between theoligonucleotide overhangs and the target nucleic acid overhangs, thelength thereof, and the temperature at which the hybridization occurs,which may be informed by melting temperatures (Tm) of the overhangs.Melting temperature generally refers to the temperature at which half ofthe oligonucleotide overhangs/target nucleic acid overhangs remainhybridized and half of the oligonucleotide overhangs/target nucleic acidoverhangs dissociate into single strands. The Tm of a duplex may beexperimentally determined or predicted using the following formulaTm=81.5+16.6(log₁₀[Na+])+0.41 (fraction G+C)−(60/N), where N is thechain length and [Na+] is less than 1 M.

In some embodiments, a method herein comprises exposing hybridizationproducts to conditions under which an end of a target nucleic acid isjoined to an end of an oligonucleotide species to which it ishybridized. Joining may be achieved by any suitable approach thatpermits covalent attachment of a target nucleic acid to theoligonucleotide to which it is hybridized. When one end of a targetnucleic acid is joined to an end of the oligonucleotide to which it ishybridized, typically two attachment events are conducted: 1) the 3′ endof one strand in the target nucleic acid to the 5′ end of one strand inthe oligonucleotide, and 2) the 5′ end of the other strand in the targetnucleic acid to the 3′ end of the other strand in the oligonucleotide.When both ends of a target nucleic acid are each joined to anoligonucleotide to which it is hybridized, typically four attachmentevents are conducted: 1) the 3′ end of one strand in the target nucleicacid to the 5′ end of one strand in the oligonucleotide, 2) the 5′ endof the other strand in the target nucleic acid to the 3′ end of theother strand in the oligonucleotide; and 3) and 4): the same as (1) and(2) for the opposite end of the target nucleic acid attached to anotheroligonucleotide.

In some embodiments, a method herein comprises contacting hybridizationproducts with an agent comprising a ligase activity under conditions inwhich an end of a target nucleic acid is covalently linked to an end ofan oligonucleotide species to which the target nucleic acid ishybridized. Ligase activity may include, for example, blunt-end ligaseactivity, nick-sealing ligase activity, sticky end ligase activity,circularization ligase activity, cohesive end ligase activity, DNAligase activity, and RNA ligase activity. Ligase activity may includeligating a 5′ end of a target nucleic acid to a 3′ end of anoligonucleotide hybridized thereto in a ligation reaction. Suitablereagents (e.g., ligases) and kits for performing ligation reactions areknown and available. For example, Instant Sticky-end Ligase Master Mixavailable from New England Biolabs (Ipswich, MA) may be used. Ligasesthat may be used include, for example, T4 DNA ligase, T7 DNA Ligase, E.coli DNA Ligase, Electro Ligase®, RNA ligases, T4 RNA ligase 2, SplintR®Ligase, and the like and combinations thereof.

In some embodiments, hybridization products are contacted with a firstagent comprising a first ligase activity and a second agent comprising asecond ligase activity different than the first ligase activity. Forexample, the first ligase activity and the second ligase activityindependently may be chosen from blunt-end ligase activity, nick-sealingligase activity, sticky end ligase activity, circularization ligaseactivity, and cohesive end ligase activity. In some embodiments, certainoligonucleotides have no overhang. Such oligonucleotides may be bluntended and may be joined (e.g., ligated) to one or more blunt ends of atarget nucleic acid.

In some embodiments, a method herein comprises joining target nucleicacids to oligonucleotides via biocompatible attachments. Methods mayinclude, for example, click chemistry or tagging, which includebiocompatible reactions useful for joining biomolecules. In someembodiments, an end of each of the oligonucleotides comprises a firstchemically reactive moiety and an end of each of the target nucleicacids includes a second chemically reactive moiety. In such embodiments,the first chemically reactive moiety typically is capable of reactingwith the second chemically reactive moiety and forming a covalent bondbetween an oligonucleotide and a target nucleic acid to which theoligonucleotide is hybridized. In some embodiments, a method hereinincludes contacting target nucleic acids with one or more chemicalagents under conditions in which the second chemically reactive moietyis incorporated at an end of each of the target nucleic acids. In someembodiments, a method herein includes exposing hybridization products toconditions in which the first chemically reactive moiety reacts with thesecond chemically reactive moiety forming a covalent bond between anoligonucleotide and a target nucleic acid to which the oligonucleotideis hybridized. In some embodiments, the first chemically reactive moietyis capable of reacting with the second chemically reactive moiety toform a 1,2,3-triazole between the oligonucleotide and the target nucleicacid to which the oligonucleotide is hybridized. In some embodiments,the first chemically reactive moiety is capable of reacting with thesecond chemically reactive moiety under conditions comprising copper.The first and second chemically reactive moieties may include anysuitable pairings. For example, the first chemically reactive moiety maybe chosen from an azide-containing moiety and 5-octadiynyl deoxyuracil,and the second chemically reactive moiety may be independently chosenfrom an azide-containing moiety, hexynyl and 5-octadiynyl deoxyuracil.In some embodiments, the azide-containing moiety is N-hydroxysuccinimide(NHS) ester-azide.

Cleavage

In some embodiments, oligonucleotides herein and/or hybridizationproducts (e.g., oligonucleotides herein hybridized to target nucleicacids) are cleaved or sheared prior to, during, or after a methoddescribed herein. In some embodiments, oligonucleotides herein and/orhybridization products are cleaved or sheared at a cleavage site. Insome embodiments, oligonucleotides herein and/or hybridization productsare cleaved or sheared at a cleavage site within a hairpin loop. In someembodiments, oligonucleotides herein and/or hybridization products arecleaved or sheared at a cleavage site at an internal location in anoligonucleotide (e.g., within a duplex region of an oligonucleotide). Insome embodiments, circular hybridization products are cleaved or shearedprior to, during, or after a method described herein. In someembodiments, nucleic acids, such as, for example, cellular nucleic acidsand/or large fragments (e.g., greater than 500 base pairs in length) arecleaved or sheared prior to, during, or after a method described herein.Large fragments may be referred to as high molecular weight (HMW)nucleic acid or HMW DNA. HMW nucleic acid fragments may includefragments greater than about 500 bp, about 600 bp, about 700 bp, about800 bp, about 900 bp, about 1000 bp, about 2000 bp, about 3000 bp, about4000 bp, about 5000 bp, about 10,000 bp, or more. The term “shearing” or“cleavage” generally refers to a procedure or conditions in which anucleic acid molecule may be severed into two (or more) smaller nucleicacid molecules. Such shearing or cleavage can be sequence specific, basespecific, or nonspecific, and can be accomplished by any of a variety ofmethods, reagents or conditions, including, for example, chemical,enzymatic, and physical (e.g., physical fragmentation). Sheared orcleaved nucleic acids may have a nominal, average or mean length ofabout 5 to about 10,000 base pairs, about 100 to about 1,000 base pairs,about 100 to about 500 base pairs, or about 10, 15, 20, 25, 30, 35, 40,45, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100, 200, 300, 400, 500,600, 700, 800, 900, 1000, 2000, 3000, 4000, 5000, 6000, 7000, 8000 or9000 base pairs.

Sheared or cleaved nucleic acids can be generated by a suitable method,non-limiting examples of which include physical methods (e.g., shearing,e.g., sonication, ultrasonication, French press, heat, UV irradiation,the like), enzymatic processes (e.g., enzymatic cleavage agents (e.g., asuitable nuclease, a suitable restriction enzyme), chemical methods(e.g., alkylation, DMS, piperidine, acid hydrolysis, base hydrolysis,heat, the like, or combinations thereof), ultraviolet (UV) light (e.g.,at a photo-cleavable site (e.g., comprising a photo-cleavable spacer),the like or combinations thereof. The average, mean or nominal length ofthe resulting nucleic acid fragments can be controlled by selecting anappropriate fragment-generating method.

The term “cleavage agent” generally refers to an agent, sometimes achemical or an enzyme that can cleave a nucleic acid at one or morespecific or non-specific sites. Specific cleavage agents often cleavespecifically according to a particular nucleotide sequence at aparticular site, which may be referred to as a cleavage site. Cleavageagents may include enzymatic cleavage agents, chemical cleaving agents,and light (e.g., ultraviolet (UV) light).

Examples of enzymatic cleavage agents include without limitationendonucleases; deoxyribonucleases (DNase; e.g., DNase I, II);ribonucleases (RNase; e.g., RNAse A, RNAse E, RNAse F, RNAse H, RNAseIII, RNAse L, RNAse P, RNAse PhyM, RNAse T1, RNAse T2, RNAse U2, andRNAse V); endonuclease VIII; CLEAVASE enzyme; TAQ DNA polymerase; E.coli DNA polymerase I; eukaryotic structure-specific endonucleases;murine FEN-1 endonucleases; nicking enzymes; type I, II or IIIrestriction endonucleases (i.e., restriction enzymes) such as Acc I,AciI, AfI III, Alu I, Alw44 I, Apa I, Asn I, Ava I, Ava 1l, BamH I, BanII, Bcl I, Bgl I. Bgl II, Bln I, Bsm I, BssH II, BstE II, BstUI, Cfo I,Cla I, Dde I, Dpn I, Dra I, EcIX I, EcoR I, EcoR I, EcoR II, EcoR V, HaeII, Hae II, Hhal, Hind II, Hind III, Hpa I, Hpa II, Kpn I, Ksp I, MaeII,McrBC, Mlu I, MluN I, Msp I, Nci I, Nco I, Nde I, Nde II, Nhe I, Not I,Nru I, Nsi I, Pst I, Pvu I, Pvu II, Rsa I, Sac I, Sal I, Sau3A I, Sca I,ScrF I, Sfi I, Sma I, Spe I, Sph I, Ssp I, Stu I, Sty I, Swa I, Taq I,Xba I, Xho I; glycosylases (e.g., uracil-DNA glycolsylase (UDG),3-methyladenine DNA glycosylase, 3-methyladenine DNA glycosylase 1l,pyrimidine hydrate-DNA glycosylase, FaPy-DNA glycosylase, thyminemismatch-DNA glycosylase (e.g., hypoxanthine-DNA glycosylase, uracil DNAglycosylase (UDG), 5-Hydroxymethyluracil DNA glycosylase (HmUDG),5-Hydroxymethylcytosine DNA glycosylase, or 1,N6-etheno-adenine DNAglycosylase); exonucleases (e.g., exonuclease I, exonuclease II,exonuclease III, exonuclease IV, exonuclease V, exonuclease VI,exonuclease VII, exonuclease VIII); 5′ to 3′ exonucleases (e.g.exonuclease II); 3′ to 5′ exonucleases (e.g. exonuclease I);poly(A)-specific 3′ to 5′ exonucleases; ribozymes; DNAzymes; and thelike and combinations thereof.

In some embodiments, a cleavage site (e.g., a cleavage site within aduplex portion of an oligonucleotide) comprises nucleotides chosen fromuracil and deoxyuridine. In some embodiments, a cleavage agent comprisesan endonuclease. In some embodiments, a cleavage agents comprises a DNAglycosylase. In some embodiments, cleavage agents comprise anendonuclease and a DNA glycosylase. In some embodiments, cleavage agentscomprise a mixture of uracil DNA glycosylase (UDG) and endonucleaseVIII.

In some embodiments, a cleavage site comprises a restriction enzymerecognition site. In some embodiments, a cleavage agent comprises arestriction enzyme. In some embodiments, a cleavage site comprises arare-cutter restriction enzyme recognition site (e.g., a NotIrecognition sequence). In some embodiments, a cleavage agent comprises arare-cutter enzyme (e.g., a rare-cutter restriction enzyme). Arare-cutter enzyme generally refers to a restriction enzyme with arecognition sequence which occurs only rarely in a genome (e.g., a humangenome). An example is NotI, which cuts after the first GC of a5′-GCGGCCGC-3′ sequence. Restriction enzymes with seven and eight basepair recognition sequences often are considered as rare-cutter enzymes.

Cleavage methods and procedures for selecting restriction enzymes forcutting DNA at specific sites are well known to the skilled artisan. Forexample, many suppliers of restriction enzymes provide information onconditions and types of DNA sequences cut by specific restrictionenzymes, including New England BioLabs, Pro-Mega Biochems,Boehringer-Mannheim, and the like. Enzymes often are used underconditions that will enable cleavage of the DNA with about 95%-100%efficiency, preferably with about 98%-100% efficiency.

In some embodiments, a cleavage site comprises one or more ribonucleicacid (RNA) nucleotides. In some embodiments, a cleavage site comprises asingle stranded portion comprising one or more RNA nucleotides. In someembodiments, the singe stranded portion is flanked by duplex portions.In some embodiments, the singe stranded portion is a hairpin loop. Insome embodiments, a cleavage site comprises one RNA nucleotide. In someembodiments, a cleavage site comprises two RNA nucleotides. In someembodiments, a cleavage site comprises three RNA nucleotides. In someembodiments, a cleavage site comprises four RNA nucleotides. In someembodiments, a cleavage site comprises five RNA nucleotides. In someembodiments, a cleavage site comprises more than five RNA nucleotides.In some embodiments, a cleavage site comprises one or more RNAnucleotides chosen from adenine (A), cytosine (C), guanine (G), anduracil (U). In some embodiments, a cleavage site comprises one or moreRNA nucleotides chosen from adenine (A), cytosine (C), and guanine (G).In some embodiments, a cleavage site comprises no uracil (U). In someembodiments, a cleavage site comprises one or more RNA nucleotidescomprising guanine (G). In some embodiments, a cleavage site comprisesone or more RNA nucleotides consisting of guanine (G). In someembodiments, a cleavage site comprises one or more RNA nucleotidescomprising cytosine (C). In some embodiments, a cleavage site comprisesone or more RNA nucleotides consisting of cytosine (C). In someembodiments, a cleavage site comprises one or more RNA nucleotidescomprising adenine (A). In some embodiments, a cleavage site comprisesone or more RNA nucleotides consisting of adenine (A). In someembodiments, a cleavage site comprises one or more RNA nucleotidesconsisting of adenine (A), cytosine (C), and guanine (G). In someembodiments, a cleavage site comprises one or more RNA nucleotidesconsisting of adenine (A) and cytosine (C). In some embodiments, acleavage site comprises one or more RNA nucleotides consisting ofadenine (A) and guanine (G). In some embodiments, a cleavage sitecomprises one or more RNA nucleotides consisting of cytosine (C) andguanine (G). In some embodiments, a cleavage agent comprises aribonuclease (RNAse). In some embodiments, an RNAse is anendoribonuclease. An RNAse may be chosen from one or more of RNAse A,RNAse E, RNAse F, RNAse H, RNAse III, RNAse L, RNAse P, RNAse PhyM,RNAse T1, RNAse T2, RNAse U2, and RNAse V.

In some embodiments, a cleavage site comprises a photo-cleavable spaceror photo-cleavable modification. Photo-cleavable modifications maycontain, for example, a photolabile functional group that is cleavableby ultraviolet (UV) light of specific wavelength (e.g., 300-350 nm). Anexample photo-cleavable spacer (available from Integrated DNATechnologies; product no. 1707) is a 10-atom linker arm that can only becleaved when exposed to UV light within the appropriate spectral range.An oligonucleotide comprising a photo-cleavable spacer can have a 5′phosphate group that is available for subsequent ligase reactions.Photo-cleavable spacers can be placed between DNA bases or between anoligo and a terminal modification (e.g., a fluorophore). In suchembodiments, ultraviolet (UV) light may be considered as a cleavageagent.

In some embodiments, a cleavage site comprises a diol. For example, acleavage site may comprise vicinal diol incorporated in a 5′ to 5′linkage. Cleavage sites comprising a diol may be chemically cleaved, forexample, using a periodate. In some embodiments, a cleavage sitecomprises a blunt end restriction enzyme recognition site. Cleavagesites comprising a blunt end restriction enzyme recognition site may becleaved by a blunt end restriction enzyme.

Nick Seal and Fill-In

In some embodiments, a method herein comprises performing a nick sealreaction (e.g., using a DNA ligase or other suitable enzyme, and, incertain instances, a kinase adapted to 5′ phosphorylate nucleic acids(e.g., a polynucleotide kinase (PNK)). In some embodiments, a methodherein comprises performing a fill-in reaction. For example, whenoligonucleotides are present as duplexes, some or all of the duplexesmay include an overhang at the end of the duplex opposite the end thathybridizes to the nucleic acids. When such duplex overhangs exist,subsequent to the combining, a method herein may further include fillingin the overhangs formed by the duplexes. In some embodiments, a fill-inreaction is performed to generate a blunt-ended hybridization product.Any suitable reagent for carrying out a fill-in reaction may be used.Polymerases suitable for performing fill-in reactions include, e.g., DNApolymerase I, large (Klenow) fragment, Bacillus stearothermophilus (Bst)DNA polymerase, and the like. In some embodiments, a strand displacingpolymerase is used (e.g., Bst DNA polymerase).

Exonuclease Treatment

In some embodiments, nucleic acid (e.g., hybridization products;circularized hybridization products) is treated with an exonuclease.Exonucleases are enzymes that work by cleaving nucleotides one at a timefrom the end of a polynucleotide chain through a hydrolyzing reactionthat breaks phosphodiester bonds at either the 3′ or the 5′ end.Exonucleases include, for example, DNAses, RNAses (e.g., RNAseH), 5′ to3′ exonucleases (e.g. exonuclease II), 3′ to 5′ exonucleases (e.g.exonuclease I), and poly(A)-specific 3′ to 5′ exonucleases. In someembodiments, hybridization products are treated with an exonuclease toremove contaminating nucleic acids such as, for example, single strandedoligonucleotides or nucleic acid fragments. In some embodiments,circularized hybridization products are treated with an exonuclease toremove any non-circularized hybridization products, non-hybridizedoligonucleotides, non-hybridized target nucleic acids, oligonucleotidedimers, and the like and combinations thereof.

Second Pool of Oligonucleotides

Certain methods described herein comprise combining a target nucleicacid with a first pool of oligonucleotides (e.g., oligonucleotidescomprising overhangs capable of hybridizing to target nucleic acidoverhangs as described herein), cleaving the combined products togenerate cleaved products, and combining the cleaved products with asecond pool of oligonucleotides. Oligonucleotides in the second pool maycomprise any feature described herein for oligonucleotides. However,oligonucleotides in the second pool generally do not comprise overhangsthat are complementary to native overhangs in target nucleic acids, andgenerally do not comprise an overhang identification sequence.

In some embodiments, a method herein comprises attaching (e.g.,annealing, hybridizing, ligating) an oligonucleotide from a second poolto at least one end of a cleaved target nucleic acid fragment (cleavedproduct). Generally, an oligonucleotide from a second pool attaches to acleaved target nucleic acid fragment at a cleaved end and does notattach to a native end. In some embodiments, a cleaved target nucleicacid fragment undergoes end repair comprising one or more of blunt-endrepair, 3′ to 5′ exonuclease treatment, 5′ fill-in, A-tailing, and 5′phosphorylation, prior to combining with an oligonucleotide from asecond pool. In some embodiments, a method herein comprises adding oneor more unpaired nucleotides (e.g., A tail) to one or both ends of acleaved product (i.e., at the cleaved end or ends). In some embodiments,an oligonucleotide from a second pool comprises one or more nucleotides(e.g., at the first end) that are complementary to the one or morenucleotides added to a cleaved product. In some embodiments, an end ofan oligonucleotide from a second pool (e.g., a first end) is capable ofbeing covalently linked to an end of a cleaved product to which theoligonucleotide is attached. In some embodiments, the 3′ end of anoligonucleotide strand is capable of being covalently linked to the 5′end (e.g., phosphorylated 5′ end) of a strand in the cleaved product towhich the oligonucleotide is attached.

An oligonucleotide from a second pool may comprise a primer bindingdomain. A primer binding domain on an oligonucleotide from a second poolmay be different from a primer binding domain on an oligonucleotide froma first pool. The primer binding domain may comprise any suitable primerbinding sequence. In some embodiments, a primer binding domain comprisesa P5 primer binding sequence. In some embodiments, a primer bindingdomain comprises a P7 primer binding sequence.

An oligonucleotide from a second pool may comprise a blunt end (e.g., ata first end), or may comprise a short (e.g., 1 bp, 2 bp, 3 bp) overhangat the first end. For example, an oligonucleotide from a second pool maycomprise a single T, A, C, G or U overhang at the first end. In someembodiments, an oligonucleotide from a second pool comprises a single Toverhang. Typically the overhang (e.g., T overhang) is on the 3′ end ofa strand at the first end.

An oligonucleotide from a second pool may comprise a phosphorothioatebackbone modification (e.g., a phosphorothioate bond between the lasttwo nucleotides on a strand). In some embodiments, an oligonucleotidefrom a second pool comprises a phosphorothioate backbone modification ona strand before an overhang (e.g., 3′ T overhang). An oligonucleotidefrom a second pool may comprise one or more modified nucleotides, suchas, for example, any modified nucleotide described herein. In someembodiments, an oligonucleotide from a second pool comprises a blockednucleotide. An oligonucleotide from a second pool may be phosphorylated.An oligonucleotide from a second pool may be phosphorylated at the firstend. Typically, an oligonucleotide from a second pool is phosphorylatedat the 5′ end of a strand at the first end.

Certain methods described herein comprise use of truncatedoligonucleotides. In some embodiments, a second pool of oligonucleotidescomprises truncated oligonucleotides. Truncated oligonucleotides may bereferred to herein as specialized oligonucleotides, specialized adapters(e.g., specialized P5 adapters), shorty oligonucleotides, shortyadapters (e.g., shorty P5 adapters), and variations thereof. A truncatedoligonucleotide generally comprise two nucleic acid strands (i.e., afirst strand and a second strand), where one strand is shorter than theother strand. In some embodiments, the first strand is shorter than thesecond strand. In some embodiments, the first strand and the secondstrand are complementary at one end of the oligonucleotide (e.g., afirst end) and the second strand comprises a single strand at the otherend of the oligonucleotide (e.g., a second end). A truncatedoligonucleotide may be designed such that the complement to the longstrand is long enough to stay annealed, but is too short to be amplified(e.g., during index PCR). A truncated oligonucleotide may comprise anyfeature described herein for oligonucleotides. However, truncatedoligonucleotides generally do not comprise overhangs that arecomplementary to native overhangs in target nucleic acids, and generallydo not comprise an overhang identification sequence.

A truncated oligonucleotide may comprise an oligonucleotideidentification sequence (e.g., barcode) specific to the truncatedoligonucleotide. An oligonucleotide identification sequence may be usedto identify a nucleic acid fragment end that is ligated to a truncatedoligonucleotide. In some instances, an oligonucleotide identificationsequence may be used to distinguish a nucleic acid fragment end that isligated to a truncated oligonucleotide versus a nucleic acid fragmentend that is ligated to a non-truncated oligonucleotide (e.g., anoverhang oligonucleotide described herein). In some instances, anoligonucleotide identification sequence may be used to identify anon-native nucleic acid fragment end (e.g., a nucleic acid fragment endgenerated by shearing). In some embodiments, a truncated oligonucleotidecomprises an oligonucleotide identification sequence that is about 5 bpto about 10 bp in length. For example, a truncated oligonucleotide maycomprise an oligonucleotide identification sequence that is about 5 bp,6 bp, 7 bp, 8 bp, 9 bp, or 10 bp in length. In some embodiments, atruncated oligonucleotide comprises an oligonucleotide identificationsequence that is 8 bp in length.

A truncated oligonucleotide may comprise a primer binding domain.Generally, the primer binding domain is on the longer strand (e.g., thesecond strand). The primer binding domain may comprise any suitableprimer binding sequence. In some embodiments, a primer binding domaincomprises a P5 primer binding sequence. In some embodiments, a primerbinding domain comprises a P7 primer binding sequence. Typically, theshorter strand (e.g., the first strand) comprises no primer bindingdomain.

A truncated oligonucleotide may comprise a blunt end (e.g., at the firstend), or may comprise a short (e.g., 1 bp, 2 bp, 3 bp) overhang at thefirst end. For example, a truncated oligonucleotide may comprise asingle T, A, C, G or U overhang at the first end. In some embodiments, atruncated oligonucleotide comprises a single T overhang. Typically theoverhang (e.g., T overhang) is on the 3′ end of the second strand.

A truncated oligonucleotide may comprise a phosphorothioate backbonemodification (e.g., a phosphorothioate bond between the last twonucleotides on a strand). In some embodiments, a truncatedoligonucleotide comprises a phosphorothioate backbone modification onthe second strand. In some embodiments, a truncated oligonucleotidecomprises a phosphorothioate backbone modification on the second strandbefore an overhang (e.g., 3′ T overhang).

A truncated oligonucleotide may comprise one or more modifiednucleotides, such as, for example, any modified nucleotide describedherein. In some embodiments, a truncated oligonucleotide comprises ablocked nucleotide (e.g., a nucleotide comprising a C3 spacer). In someembodiments, a truncated oligonucleotide comprises a blocked nucleotideon the second strand. Typically, the blocked nucleotide is on the 5′ endof the second strand. A truncated oligonucleotide may be phosphorylated.A truncated oligonucleotide may be phosphorylated at the first end.Typically, a truncated oligonucleotide is phosphorylated at the 5′ endof the first strand.

Samples

Provided herein are methods and compositions for processing and/oranalyzing nucleic acid. Nucleic acid or a nucleic acid mixture utilizedin methods and compositions described herein may be isolated from asample obtained from a subject (e.g., a test subject). A subject can beany living or non-living organism, including but not limited to a human,a non-human animal, a plant, a bacterium, a fungus, a protist or apathogen. Any human or non-human animal can be selected, and mayinclude, for example, mammal, reptile, avian, amphibian, fish, ungulate,ruminant, bovine (e.g., cattle), equine (e.g., horse), caprine and ovine(e.g., sheep, goat), swine (e.g., pig), camelid (e.g., camel, llama,alpaca), monkey, ape (e.g., gorilla, chimpanzee), ursid (e.g., bear),poultry, dog, cat, mouse, rat, fish, dolphin, whale and shark. A subjectmay be a male or female (e.g., woman, a pregnant woman). A subject maybe any age (e.g., an embryo, a fetus, an infant, a child, an adult). Asubject may be a cancer patient, a patient suspected of having cancer, apatient in remission, a patient with a family history of cancer, and/ora subject obtaining a cancer screen. A subject may be a patient havingan infection or infectious disease or infected with a pathogen (e.g.,bacteria, virus, fungus, protozoa, and the like), a patient suspected ofhaving an infection or infectious disease or being infected with apathogen, a patient recovering from an infection, infectious disease, orpathogenic infection, a patient with a history of infections, infectiousdisease, pathogenic infections, and/or a subject obtaining an infectiousdisease or pathogen screen. A subject may be a transplant recipient. Asubject may be a patient undergoing a microbiome analysis. In someembodiments, a test subject is a female. In some embodiments, a testsubject is a human female. In some embodiments, a test subject is amale. In some embodiments, a test subject is a human male.

A nucleic acid sample may be isolated or obtained from any type ofsuitable biological specimen or sample (e.g., a test sample). A nucleicacid sample may be isolated or obtained from a single cell, a pluralityof cells (e.g., cultured cells), cell culture media, conditioned media,a tissue, an organ, or an organism (e.g., bacteria, yeast, or the like).In some embodiments, a nucleic acid sample is isolated or obtained froma cell(s), tissue, organ, and/or the like of an animal (e.g., an animalsubject). In some embodiments, a nucleic acid sample is isolated orobtained from a source such as bacteria, yeast, insects (e.g.,drosophila), mammals, amphibians (e.g., frogs (e.g., Xenopus)), viruses,plants, or any other mammalian or non-mammalian nucleic acid samplesource.

A nucleic acid sample may be isolated or obtained from an extantorganism or animal. In some instances, a nucleic acid sample may beisolated or obtained from an extinct (or “ancient”) organism or animal(e.g., an extinct mammal; an extinct mammal from the genus Homo). Insome instances, a nucleic acid sample may be obtained as part of aforensics analysis. In some instances, a nucleic acid sample may beobtained as part of a diagnostic analysis.

A sample or test sample may be any specimen that is isolated or obtainedfrom a subject or part thereof (e.g., a human subject, a pregnantfemale, a cancer patient, a patient having an infection or infectiousdisease, a transplant recipient, a fetus, a tumor, an infected organ ortissue, a transplanted organ or tissue, a microbiome). A samplesometimes is from a pregnant female subject bearing a fetus at any stageof gestation (e.g., first, second or third trimester for a humansubject), and sometimes is from a post-natal subject. A sample sometimesis from a pregnant subject bearing a fetus that is euploid for allchromosomes, and sometimes is from a pregnant subject bearing a fetushaving a chromosome aneuploidy (e.g., one, three (i.e., trisomy (e.g.,T21, T18, T13)), or four copies of a chromosome) or other geneticvariation. Non-limiting examples of specimens include fluid or tissuefrom a subject, including, without limitation, blood or a blood product(e.g., serum, plasma, or the like), umbilical cord blood, chorionicvilli, amniotic fluid, cerebrospinal fluid, spinal fluid, lavage fluid(e.g., bronchoalveolar, gastric, peritoneal, ductal, ear, arthroscopic),biopsy sample (e.g., from pre-implantation embryo; cancer biopsy),celocentesis sample, cells (blood cells, placental cells, embryo orfetal cells, fetal nucleated cells or fetal cellular remnants, normalcells, abnormal cells (e.g., cancer cells)) or parts thereof (e.g.,mitochondrial, nucleus, extracts, or the like), washings of femalereproductive tract, urine, feces, sputum, saliva, nasal mucous, prostatefluid, lavage, semen, lymphatic fluid, bile, tears, sweat, breast milk,breast fluid, the like or combinations thereof. In some embodiments, abiological sample is a cervical swab from a subject. A fluid or tissuesample from which nucleic acid is extracted may be acellular (e.g.,cell-free). In some embodiments, a fluid or tissue sample may containcellular elements or cellular remnants. In some embodiments, fetal cellsor cancer cells may be included in the sample.

A sample can be a liquid sample. A liquid sample can compriseextracellular nucleic acid (e.g., circulating cell-free DNA).Non-limiting examples of liquid samples, include, blood or a bloodproduct (e.g., serum, plasma, or the like), urine, biopsy sample (e.g.,liquid biopsy for the detection of cancer), a liquid sample describedabove, the like or combinations thereof. In certain embodiments, asample is a liquid biopsy, which generally refers to an assessment of aliquid sample from a subject for the presence, absence, progression orremission of a disease (e.g., cancer). A liquid biopsy can be used inconjunction with, or as an alternative to, a sold biopsy (e.g., tumorbiopsy). In certain instances, extracellular nucleic acid is analyzed ina liquid biopsy.

In some embodiments, a biological sample may be blood, plasma or serum.The term “blood” encompasses whole blood, blood product or any fractionof blood, such as serum, plasma, buffy coat, or the like asconventionally defined. Blood or fractions thereof often comprisenucleosomes. Nucleosomes comprise nucleic acids and are sometimescell-free or intracellular. Blood also comprises buffy coats. Buffycoats are sometimes isolated by utilizing a ficoll gradient. Buffy coatscan comprise white blood cells (e.g., leukocytes, T-cells, B-cells,platelets, and the like). Blood plasma refers to the fraction of wholeblood resulting from centrifugation of blood treated withanticoagulants. Blood serum refers to the watery portion of fluidremaining after a blood sample has coagulated. Fluid or tissue samplesoften are collected in accordance with standard protocols hospitals orclinics generally follow. For blood, an appropriate amount of peripheralblood (e.g., between 3 to 40 milliliters, between 5 to 50 milliliters)often is collected and can be stored according to standard proceduresprior to or after preparation.

An analysis of nucleic acid found in a subject's blood may be performedusing, e.g., whole blood, serum, or plasma. An analysis of fetal DNAfound in maternal blood, for example, may be performed using, e.g.,whole blood, serum, or plasma. An analysis of tumor or cancer DNA foundin a patient's blood, for example, may be performed using, e.g., wholeblood, serum, or plasma. An analysis of pathogen DNA found in apatient's blood, for example, may be performed using, e.g., whole blood,serum, or plasma. An analysis of transplant DNA found in a transplantrecipient's blood, for example, may be performed using, e.g., wholeblood, serum, or plasma. Methods for preparing serum or plasma fromblood obtained from a subject (e.g., a maternal subject; patient; cancerpatient) are known. For example, a subject's blood (e.g., a pregnantwoman's blood; patient's blood; cancer patient's blood) can be placed ina tube containing EDTA or a specialized commercial product such asVacutainer SST (Becton Dickinson, Franklin Lakes, N.J.) to prevent bloodclotting, and plasma can then be obtained from whole blood throughcentrifugation. Serum may be obtained with or withoutcentrifugation-following blood clotting. If centrifugation is used thenit is typically, though not exclusively, conducted at an appropriatespeed, e.g., 1,500-3,000 times g. Plasma or serum may be subjected toadditional centrifugation steps before being transferred to a fresh tubefor nucleic acid extraction. In addition to the acellular portion of thewhole blood, nucleic acid may also be recovered from the cellularfraction, enriched in the buffy coat portion, which can be obtainedfollowing centrifugation of a whole blood sample from the subject andremoval of the plasma.

A sample may be a tumor nucleic acid sample (i.e., a nucleic acid sampleisolated from a tumor). The term “tumor” generally refers to neoplasticcell growth and proliferation, whether malignant or benign, and mayinclude pre-cancerous and cancerous cells and tissues. The terms“cancer” and “cancerous” generally refer to the physiological conditionin mammals that is typically characterized by unregulated cellgrowth/proliferation. Examples of cancer include, but are not limitedto, carcinoma, lymphoma, blastoma, sarcoma, leukemia, squamous cellcancer, small-cell lung cancer, non-small cell lung cancer,adenocarcinoma of the lung, squamous carcinoma of the lung, cancer ofthe peritoneum, hepatocellular cancer, gastrointestinal cancer,pancreatic cancer, glioblastoma, cervical cancer, ovarian cancer, livercancer, bladder cancer, hepatoma, breast cancer, colon cancer,colorectal cancer, endometrial or uterine carcinoma, salivary glandcarcinoma, kidney cancer, liver cancer, prostate cancer, vulval cancer,thyroid cancer, hepatic carcinoma, various types of head and neckcancer, and the like.

A sample may be heterogeneous. For example, a sample may include morethan one cell type and/or one or more nucleic acid species. In someinstances, a sample may include (i) fetal cells and maternal cells, (ii)cancer cells and non-cancer cells, and/or (iii) pathogenic cells andhost cells. In some instances, a sample may include (i) cancer andnon-cancer nucleic acid, (ii) pathogen and host nucleic acid, (iii)fetal derived and maternal derived nucleic acid, and/or more generally,(iv) mutated and wild-type nucleic acid. In some instances, a sample mayinclude a minority nucleic acid species and a majority nucleic acidspecies, as described in further detail below. In some instances, asample may include cells and/or nucleic acid from a single subject ormay include cells and/or nucleic acid from multiple subjects.

Nucleic Acid

Provided herein are methods and compositions for processing and/oranalyzing nucleic acid. The terms nucleic acid(s), nucleic acidmolecule(s), nucleic acid fragment(s), target nucleic acid(s), nucleicacid template(s), template nucleic acid(s), nucleic acid target(s),target nucleic acid(s), polynucleotide(s), polynucleotide fragment(s),target polynucleotide(s), polynucleotide target(s), and the like may beused interchangeably throughout the disclosure. The terms refer tonucleic acids of any composition from, such as DNA (e.g., complementaryDNA (cDNA; synthesized from any RNA or DNA of interest), genomic DNA(gDNA), genomic DNA fragments, mitochondrial DNA (mtDNA), recombinantDNA (e.g., plasmid DNA), and the like), RNA (e.g., message RNA (mRNA),short inhibitory RNA (siRNA), ribosomal RNA (rRNA), transfer RNA (tRNA),microRNA, transacting small interfering RNA (ta-siRNA), natural smallinterfering RNA (nat-siRNA), small nucleolar RNA (snoRNA), small nuclearRNA (snRNA), long non-coding RNA (lncRNA), non-coding RNA (ncRNA),transfer-messenger RNA (tmRNA), precursor messenger RNA (pre-mRNA),small Cajal body-specific RNA (scaRNA), piwi-interacting RNA (piRNA),endoribonuclease-prepared siRNA (esiRNA), small temporal RNA (stRNA),signal recognition RNA, telomere RNA, RNA highly expressed by a fetus orplacenta, and the like), and/or DNA or RNA analogs (e.g., containingbase analogs, sugar analogs and/or a non-native backbone and the like),RNA/DNA hybrids and polyamide nucleic acids (PNAs), all of which can bein single- or double-stranded form, and unless otherwise limited, canencompass known analogs of natural nucleotides that can function in asimilar manner as naturally occurring nucleotides. A nucleic acid maybe, or may be from, a plasmid, phage, virus, bacterium, autonomouslyreplicating sequence (ARS), mitochondria, centromere, artificialchromosome, chromosome, or other nucleic acid able to replicate or bereplicated in vitro or in a host cell, a cell, a cell nucleus orcytoplasm of a cell in certain embodiments. A template nucleic acid insome embodiments can be from a single chromosome (e.g., a nucleic acidsample may be from one chromosome of a sample obtained from a diploidorganism). Unless specifically limited, the term encompasses nucleicacids containing known analogs of natural nucleotides that have similarbinding properties as the reference nucleic acid and are metabolized ina manner similar to naturally occurring nucleotides. Unless otherwiseindicated, a particular nucleic acid sequence also implicitlyencompasses conservatively modified variants thereof (e.g., degeneratecodon substitutions), alleles, orthologs, single nucleotidepolymorphisms (SNPs), and complementary sequences as well as thesequence explicitly indicated. Specifically, degenerate codonsubstitutions may be achieved by generating sequences in which the thirdposition of one or more selected (or all) codons is substituted withmixed-base and/or deoxyinosine residues. The term nucleic acid is usedinterchangeably with locus, gene, cDNA, and mRNA encoded by a gene. Theterm also may include, as equivalents, derivatives, variants and analogsof RNA or DNA synthesized from nucleotide analogs, single-stranded(“sense” or “antisense,” “plus” strand or “minus” strand, “forward”reading frame or “reverse” reading frame) and double-strandedpolynucleotides. The term “gene” refers to a section of DNA involved inproducing a polypeptide chain; and generally includes regions precedingand following the coding region (leader and trailer) involved in thetranscription/translation of the gene product and the regulation of thetranscription/translation, as well as intervening sequences (introns)between individual coding regions (exons). A nucleotide or basegenerally refers to the purine and pyrimidine molecular units of nucleicacid (e.g., adenine (A), thymine (T), guanine (G), and cytosine (C)).For RNA, the base thymine is replaced with uracil. Nucleic acid lengthor size may be expressed as a number of bases.

Target nucleic acids may be any nucleic acids of interest. Nucleic acidsmay be polymers of any length composed of deoxyribonucleotides (i.e.,DNA bases), ribonucleotides (i.e., RNA bases), or combinations thereof,e.g., 10 bases or longer, 20 bases or longer, 50 bases or longer, 100bases or longer, 200 bases or longer, 300 bases or longer, 400 bases orlonger, 500 bases or longer, 1000 bases or longer, 2000 bases or longer,3000 bases or longer, 4000 bases or longer, 5000 bases or longer. Incertain aspects, nucleic acids are polymers composed ofdeoxyribonucleotides (i.e., DNA bases), ribonucleotides (i.e., RNAbases), or combinations thereof, e.g., 10 bases or less, 20 bases orless, 50 bases or less, 100 bases or less, 200 bases or less, 300 basesor less, 400 bases or less, 500 bases or less, 1000 bases or less, 2000bases or less, 3000 bases or less, 4000 bases or less, or 5000 bases orless.

Nucleic acid may be single or double stranded. Single stranded DNA, forexample, can be generated by denaturing double stranded DNA by heatingor by treatment with alkali, for example. In certain embodiments,nucleic acid is in a D-loop structure, formed by strand invasion of aduplex DNA molecule by an oligonucleotide or a DNA-like molecule such aspeptide nucleic acid (PNA). D loop formation can be facilitated byaddition of E. Coli RecA protein and/or by alteration of saltconcentration, for example, using methods known in the art.

Nucleic acid (e.g., nucleic acid targets, oligonucleotides, overhangs)may be described herein as being complementary to another nucleic acidor having a complementarity region. The terms “complementary” or“complementarity” as used herein refer to a nucleotide sequence thatbase-pairs by non-covalent bonds to a region of a nucleic acid (e.g.,target). In the canonical Watson-Crick base pairing, adenine (A) forms abase pair with thymine (T), and guanine (G) pairs with cytosine (C) inDNA. In RNA, thymine is replaced by uracil (U). As such, A iscomplementary to T and G is complementary to C. In RNA, A iscomplementary to U and vice versa. Typically, “complementary” or“complementarity” refers to a nucleotide sequence that is at leastpartially complementary. These terms may also encompass duplexes thatare fully complementary such that every nucleotide in one strand iscomplementary to every nucleotide in the other strand in correspondingpositions.

In certain instances, a nucleotide sequence may be partiallycomplementary to a target, in which not all nucleotides arecomplementary to every nucleotide in the target nucleic acid in all thecorresponding positions. For example, an oligonucleotide overhang may beperfectly (i.e., 100%) complementary to a target nucleic acid overhang,or an oligonucleotide overhang may share some degree of complementaritywhich is less than perfect (e.g., 70%, 75%, 85%, 90%, 95%, 99%).

In some embodiments, nucleic acids in a mixture of nucleic acids areanalyzed. A mixture of nucleic acids can comprise two or more nucleicacid species having the same or different nucleotide sequences,different lengths, different origins (e.g., genomic origins, fetal vs.maternal origins, cell or tissue origins, cancer vs. non-cancer origin,tumor vs. non-tumor origin, host vs. pathogen, host vs. transplant, hostvs. microbiome, sample origins, subject origins, and the like),different overhang lengths, different overhang types (e.g., 5′overhangs, 3′ overhangs, no overhangs), or combinations thereof. Nucleicacid provided for processes described herein may contain nucleic acidfrom one sample or from two or more samples (e.g., from 1 or more, 2 ormore, 3 or more, 4 or more, 5 or more, 6 or more, 7 or more, 8 or more,9 or more, 10 or more, 11 or more, 12 or more, 13 or more, 14 or more,15 or more, 16 or more, 17 or more, 18 or more, 19 or more, or 20 ormore samples).

In some embodiments, target nucleic acids comprise degraded DNA.Degraded DNA may be referred to as low-quality DNA or highly degradedDNA. Degraded DNA may be highly fragmented, and may include damage suchas base analogs and abasic sites subject to miscoding lesions and/orintermolecular crosslinking. For example, sequencing errors resultingfrom deamination of cytosine residues may be present in certainsequences obtained from degraded DNA (e.g., miscoding of C to T and G toA).

Nucleic acid may be derived from one or more sources (e.g., biologicalsample, blood, cells, serum, plasma, buffy coat, urine, lymphatic fluid,skin, soil, and the like) by methods known in the art. Any suitablemethod can be used for isolating, extracting and/or purifying DNA from abiological sample (e.g., from blood or a blood product), non-limitingexamples of which include methods of DNA preparation (e.g., described bySambrook and Russell, Molecular Cloning: A Laboratory Manual 3d ed.,2001), various commercially available reagents or kits, such as DNeasy®,RNeasy®, QIAprep®, QIAquick®, and QIAamp® (e.g., QIAamp® CirculatingNucleic Acid Kit, QiaAmp© DNA Mini Kit or QiaAmp® DNA Blood Mini Kit)nucleic acid isolation/purification kits by Qiagen, Inc. (Germantown,Md); GenomicPrep™ Blood DNA Isolation Kit (Promega, Madison, Wis.); GFX™Genomic Blood DNA Purification Kit (Amersham, Piscataway, N.J.);DNAzol©, ChargeSwitch®, Purelink®, GeneCatcher® nucleic acidisolation/purification kits by Life Technologies, Inc. (Carlsbad, CA);NucleoMag®, NucleoSpin®, and NucleoBond® nucleic acidisolation/purification kits by Clontech Laboratories, Inc. (MountainView, CA); the like or combinations thereof. In certain aspects, thenucleic acid is isolated from a fixed biological sample, e.g.,formalin-fixed, paraffin-embedded (FFPE) tissue. Genomic DNA from FFPEtissue may be isolated using commercially available kits—such as theAllPrep® DNA/RNA FFPE kit by Qiagen, Inc. (Germantown, Md), theRecoverAll® Total Nucleic Acid Isolation kit for FFPE by LifeTechnologies, Inc. (Carlsbad, CA), and the NucleoSpin® FFPE kits byClontech Laboratories, Inc. (Mountain View, CA).

In some embodiments, nucleic acid is extracted from cells using a celllysis procedure. Cell lysis procedures and reagents are known in the artand may generally be performed by chemical (e.g., detergent, hypotonicsolutions, enzymatic procedures, and the like, or combination thereof),physical (e.g., French press, sonication, and the like), or electrolyticlysis methods. Any suitable lysis procedure can be utilized. Forexample, chemical methods generally employ lysing agents to disruptcells and extract the nucleic acids from the cells, followed bytreatment with chaotropic salts. Physical methods such as freeze/thawfollowed by grinding, the use of cell presses and the like also areuseful. In some instances, a high salt and/or an alkaline lysisprocedure may be utilized. In some instances, a lysis procedure mayinclude a lysis step with EDTA/Proteinase K, a binding buffer step withhigh amount of salts (e.g., guanidinium chloride (GuHCl), sodiumacetate) and isopropanol, and binding DNA in this solution tosilica-based column. In some instances, a lysis protocol includescertain procedures described in Dabney et al., Proceedings of theNational Academy of Sciences 110, no. 39 (2013): 15758-15763.

Nucleic acids can include extracellular nucleic acid in certainembodiments. The term “extracellular nucleic acid” as used herein canrefer to nucleic acid isolated from a source having substantially nocells and also is referred to as “cell-free” nucleic acid (cell-freeDNA, cell-free RNA, or both), “circulating cell-free nucleic acid”(e.g., CCF fragments, ccf DNA) and/or “cell-free circulating nucleicacid.” Extracellular nucleic acid can be present in and obtained fromblood (e.g., from the blood of a human subject). Extracellular nucleicacid often includes no detectable cells and may contain cellularelements or cellular remnants. Non-limiting examples of acellularsources for extracellular nucleic acid are blood, blood plasma, bloodserum and urine. In certain aspects, cell-free nucleic acid is obtainedfrom a body fluid sample chosen from whole blood, blood plasma, bloodserum, amniotic fluid, saliva, urine, pleural effusion, bronchiallavage, bronchial aspirates, breast milk, colostrum, tears, seminalfluid, peritoneal fluid, pleural effusion, and stool. As used herein,the term “obtain cell-free circulating sample nucleic acid” includesobtaining a sample directly (e.g., collecting a sample, e.g., a testsample) or obtaining a sample from another who has collected a sample.Extracellular nucleic acid may be a product of cellular secretion and/ornucleic acid release (e.g., DNA release). Extracellular nucleic acid maybe a product of any form of cell death, for example. In some instances,extracellular nucleic acid is a product of any form of type I or type IIcell death, including mitotic, oncotic, toxic, ischemic, and the likeand combinations thereof. Without being limited by theory, extracellularnucleic acid may be a product of cell apoptosis and cell breakdown,which provides basis for extracellular nucleic acid often having aseries of lengths across a spectrum (e.g., a “ladder”). In someinstances, extracellular nucleic acid is a product of cell necrosis,necropoptosis, oncosis, entosis, pyrotosis, and the like andcombinations thereof. In some embodiments, sample nucleic acid from atest subject is circulating cell-free nucleic acid. In some embodiments,circulating cell free nucleic acid is from blood plasma or blood serumfrom a test subject. In some aspects, cell-free nucleic acid isdegraded. In some embodiments, cell-free nucleic acid comprisescell-free fetal nucleic acid (e.g., cell-free fetal DNA). In certainaspects, cell-free nucleic acid comprises circulating cancer nucleicacid (e.g., cancer DNA). In certain aspects, cell-free nucleic acidcomprises circulating tumor nucleic acid (e.g., tumor DNA). In someembodiments, cell-free nucleic acid comprises infectious agent nucleicacid (e.g., pathogen DNA). In some embodiments, cell-free nucleic acidcomprises nucleic acid (e.g., DNA) from a transplant. In someembodiments, cell-free nucleic acid comprises nucleic acid (e.g., DNA)from a microbiome (e.g., microbiome of gut, microbiome of blood,microbiome of mouth, microbiome of spinal fluid, microbiome of feces).

Extracellular nucleic acid can include different nucleic acid species,and therefore is referred to herein as “heterogeneous” in certainembodiments. For example, blood serum or plasma from a person having atumor or cancer can include nucleic acid from tumor cells or cancercells (e.g., neoplasia) and nucleic acid from non-tumor cells ornon-cancer cells. In another example, blood serum or plasma from apregnant female can include maternal nucleic acid and fetal nucleicacid. In another example, blood serum or plasma from a patient having aninfection or infectious disease can include host nucleic acid andinfectious agent or pathogen nucleic acid. In another example, a samplefrom a subject having received a transplant can include host nucleicacid and nucleic acid from the donor organ or tissue. In some instances,cancer nucleic acid, tumor nucleic acid, fetal nucleic acid, pathogennucleic acid, or transplant nucleic acid sometimes is about 5% to about50% of the overall nucleic acid (e.g., about 4, 5, 6, 7, 8, 9, 10, 11,12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29,30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47,48, or 49% of the total nucleic acid is cancer, tumor, fetal, pathogen,transplant, or microbiome nucleic acid). In another example,heterogeneous nucleic acid may include nucleic acid from two or moresubjects (e.g., a sample from a crime scene).

At least two different nucleic acid species can exist in differentamounts in extracellular nucleic acid and sometimes are referred to asminority species and majority species. In certain instances, a minorityspecies of nucleic acid is from an affected cell type (e.g., cancercell, wasting cell, cell attacked by immune system). In certainembodiments, a genetic variation or genetic alteration (e.g., copynumber alteration, copy number variation, single nucleotide alteration,single nucleotide variation, chromosome alteration, and/ortranslocation) is determined for a minority nucleic acid species. Incertain embodiments, a genetic variation or genetic alteration isdetermined for a majority nucleic acid species. Generally, it is notintended that the terms “minority” or “majority” be rigidly defined inany respect. In one aspect, a nucleic acid that is considered“minority,” for example, can have an abundance of at least about 0.1% ofthe total nucleic acid in a sample to less than 50% of the total nucleicacid in a sample. In some embodiments, a minority nucleic acid can havean abundance of at least about 1% of the total nucleic acid in a sampleto about 40% of the total nucleic acid in a sample. In some embodiments,a minority nucleic acid can have an abundance of at least about 2% ofthe total nucleic acid in a sample to about 30% of the total nucleicacid in a sample. In some embodiments, a minority nucleic acid can havean abundance of at least about 3% of the total nucleic acid in a sampleto about 25% of the total nucleic acid in a sample. For example, aminority nucleic acid can have an abundance of about 1%, 2%, 3%, 4%, 5%,6%, 7%, 8%, 9%, 10%, 11%, 12%, 13%, 14%, 15%, 16%, 17%, 18%, 19%, 20%,21%, 22%, 23%, 24%, 25%, 26%, 27%, 28%, 29% or 30% of the total nucleicacid in a sample. In some instances, a minority species of extracellularnucleic acid sometimes is about 1% to about 40% of the overall nucleicacid (e.g., about 1%, 2%, 3%, 4%, 5%, 6%, 7%, 8%, 9%, 10%, 11%, 12%,13%, 14%, 15%, 16%, 17%, 18%, 19%, 20%, 21%, 22%, 23%, 24%, 25%, 26%,27%, 28%, 29%, 30%, 31%, 32%, 33%, 34%, 35%, 36%, 37%, 38%, 39% or 40%of the nucleic acid is minority species nucleic acid). In someembodiments, the minority nucleic acid is extracellular DNA. In someembodiments, the minority nucleic acid is extracellular DNA fromapoptotic tissue. In some embodiments, the minority nucleic acid isextracellular DNA from tissue where some cells therein underwentapoptosis. In some embodiments, the minority nucleic acid isextracellular DNA from necrotic tissue. In some embodiments, theminority nucleic acid is extracellular DNA from tissue where some cellstherein underwent necrosis. Necrosis may refer to a post-mortem processfollowing cell death, in certain instances. In some embodiments, theminority nucleic acid is extracellular DNA from tissue affected by acell proliferative disorder (e.g., cancer). In some embodiments, theminority nucleic acid is extracellular DNA from a tumor cell. In someembodiments, the minority nucleic acid is extracellular fetal DNA. Insome embodiments, the minority nucleic acid is extracellular DNA from apathogen. In some embodiments, the minority nucleic acid isextracellular DNA from a transplant. In some embodiments, the minoritynucleic acid is extracellular DNA from a microbiome.

In another aspect, a nucleic acid that is considered “majority,” forexample, can have an abundance greater than 50% of the total nucleicacid in a sample to about 99.9% of the total nucleic acid in a sample.In some embodiments, a majority nucleic acid can have an abundance of atleast about 60% of the total nucleic acid in a sample to about 99% ofthe total nucleic acid in a sample. In some embodiments, a majoritynucleic acid can have an abundance of at least about 70% of the totalnucleic acid in a sample to about 98% of the total nucleic acid in asample. In some embodiments, a majority nucleic acid can have anabundance of at least about 75% of the total nucleic acid in a sample toabout 97% of the total nucleic acid in a sample. For example, a majoritynucleic acid can have an abundance of at least about 70%, 71%, 72%, 73%,74%, 75%, 76%, 77%, 78%, 79%, 80%, 81%, 82%, 83%, 84%, 85%, 86%, 87%,88%, 89%, 90%, 91%, 92%, 93%, 94%, 95%, 96%, 97%, 98% or 99% of thetotal nucleic acid in a sample. In some embodiments, the majoritynucleic acid is extracellular DNA. In some embodiments, the majoritynucleic acid is extracellular maternal DNA. In some embodiments, themajority nucleic acid is DNA from healthy tissue. In some embodiments,the majority nucleic acid is DNA from non-tumor cells. In someembodiments, the majority nucleic acid is DNA from host cells.

In some embodiments, a minority species of extracellular nucleic acid isof a length of about 500 base pairs or less (e.g., about 80, 85, 90, 91,92, 93, 94, 95, 96, 97, 98, 99 or 100% of minority species nucleic acidis of a length of about 500 base pairs or less). In some embodiments, aminority species of extracellular nucleic acid is of a length of about300 base pairs or less (e.g., about 80, 85, 90, 91, 92, 93, 94, 95, 96,97, 98, 99 or 100% of minority species nucleic acid is of a length ofabout 300 base pairs or less). In some embodiments, a minority speciesof extracellular nucleic acid is of a length of about 250 base pairs orless (e.g., about 80, 85, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99 or 100%of minority species nucleic acid is of a length of about 250 base pairsor less). In some embodiments, a minority species of extracellularnucleic acid is of a length of about 200 base pairs or less (e.g., about80, 85, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99 or 100% of minorityspecies nucleic acid is of a length of about 200 base pairs or less). Insome embodiments, a minority species of extracellular nucleic acid is ofa length of about 150 base pairs or less (e.g., about 80, 85, 90, 91,92, 93, 94, 95, 96, 97, 98, 99 or 100% of minority species nucleic acidis of a length of about 150 base pairs or less). In some embodiments, aminority species of extracellular nucleic acid is of a length of about100 base pairs or less (e.g., about 80, 85, 90, 91, 92, 93, 94, 95, 96,97, 98, 99 or 100% of minority species nucleic acid is of a length ofabout 100 base pairs or less). In some embodiments, a minority speciesof extracellular nucleic acid is of a length of about 50 base pairs orless (e.g., about 80, 85, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99 or 100%of minority species nucleic acid is of a length of about 50 base pairsor less).

Nucleic acid may be provided for conducting methods described hereinwith or without processing of the sample(s) containing the nucleic acid.In some embodiments, nucleic acid is provided for conducting methodsdescribed herein after processing of the sample(s) containing thenucleic acid. For example, a nucleic acid can be extracted, isolated,purified, partially purified or amplified from the sample(s). The term“isolated” as used herein refers to nucleic acid removed from itsoriginal environment (e.g., the natural environment if it is naturallyoccurring, or a host cell if expressed exogenously), and thus is alteredby human intervention (e.g., “by the hand of man”) from its originalenvironment. The term “isolated nucleic acid” as used herein can referto a nucleic acid removed from a subject (e.g., a human subject). Anisolated nucleic acid can be provided with fewer non-nucleic acidcomponents (e.g., protein, lipid) than the amount of components presentin a source sample. A composition comprising isolated nucleic acid canbe about 50% to greater than 99% free of non-nucleic acid components. Acomposition comprising isolated nucleic acid can be about 90%, 91%, 92%,93%, 94%, 95%, 96%, 97%, 98%, 99% or greater than 99% free ofnon-nucleic acid components. The term “purified” as used herein canrefer to a nucleic acid provided that contains fewer non-nucleic acidcomponents (e.g., protein, lipid, carbohydrate) than the amount ofnon-nucleic acid components present prior to subjecting the nucleic acidto a purification procedure. A composition comprising purified nucleicacid may be about 80%, 81%, 82%, 83%, 84%, 85%, 86%, 87%, 88%, 89%, 90%,91%, 92%, 93%, 94%, 95%, 96%, 97%, 98%, 99% or greater than 99% free ofother non-nucleic acid components. The term “purified” as used hereincan refer to a nucleic acid provided that contains fewer nucleic acidspecies than in the sample source from which the nucleic acid isderived. A composition comprising purified nucleic acid may be about90%, 91%, 92%, 93%, 94%, 95%, 96%, 97%, 98%, 99% or greater than 99%free of other nucleic acid species. For example, fetal nucleic acid canbe purified from a mixture comprising maternal and fetal nucleic acid.In certain examples, small fragments of nucleic acid (e.g., 30 to 500 bpfragments) can be purified, or partially purified, from a mixturecomprising nucleic acid fragments of different lengths. In certainexamples, nucleosomes comprising smaller fragments of nucleic acid canbe purified from a mixture of larger nucleosome complexes comprisinglarger fragments of nucleic acid. In certain examples, larger nucleosomecomplexes comprising larger fragments of nucleic acid can be purifiedfrom nucleosomes comprising smaller fragments of nucleic acid. Incertain examples, small fragments of fetal nucleic acid (e.g., 30 to 500bp fragments) can be purified, or partially purified, from a mixturecomprising both fetal and maternal nucleic acid fragments. In certainexamples, nucleosomes comprising smaller fragments of fetal nucleic acidcan be purified from a mixture of larger nucleosome complexes comprisinglarger fragments of maternal nucleic acid. In certain examples, cancercell nucleic acid can be purified from a mixture comprising cancer celland non-cancer cell nucleic acid. In certain examples, nucleosomescomprising small fragments of cancer cell nucleic acid can be purifiedfrom a mixture of larger nucleosome complexes comprising largerfragments of non-cancer nucleic acid. In some embodiments, nucleic acidis provided for conducting methods described herein without priorprocessing of the sample(s) containing the nucleic acid. For example,nucleic acid may be analyzed directly from a sample without priorextraction, purification, partial purification, and/or amplification.

Nucleic acids may be amplified under amplification conditions. The term“amplified” or “amplification” or “amplification conditions” as usedherein refers to subjecting a target nucleic acid in a sample to aprocess that linearly or exponentially generates amplicon nucleic acidshaving the same or substantially the same nucleotide sequence as thetarget nucleic acid, or part thereof. In certain embodiments, the term“amplified” or “amplification” or “amplification conditions” refers to amethod that comprises a polymerase chain reaction (PCR). In certaininstances, an amplified product can contain one or more nucleotides morethan the amplified nucleotide region of a nucleic acid template sequence(e.g., a primer can contain “extra” nucleotides such as atranscriptional initiation sequence, in addition to nucleotidescomplementary to a nucleic acid template gene molecule, resulting in anamplified product containing “extra” nucleotides or nucleotides notcorresponding to the amplified nucleotide region of the nucleic acidtemplate gene molecule).

Nucleic acid also may be exposed to a process that modifies certainnucleotides in the nucleic acid before providing nucleic acid for amethod described herein. A process that selectively modifies nucleicacid based upon the methylation state of nucleotides therein can beapplied to nucleic acid, for example. In addition, conditions such ashigh temperature, ultraviolet radiation, x-radiation, can induce changesin the sequence of a nucleic acid molecule. Nucleic acid may be providedin any suitable form useful for conducting a sequence analysis.

In some embodiments, target nucleic acids are not modified in lengthprior to combining with the oligonucleotides herein. In this context,“not modified” means that target nucleic acids are isolated from asample and then combined with oligonucleotides without modifying thelength of the target nucleic acids. For example, target nucleic acidsare not shortened (e.g., they are not contacted with a restrictionenzyme or nuclease or physical condition that reduces length (e.g.,shearing condition, cleavage condition)) and are not increased in lengthby one or more nucleotides (e.g., ends are not filled in at overhangs;no nucleotides are added to the ends). Adding a phosphate or chemicallyreactive group to one or both ends of a target nucleic acid generally isnot considered modifying the length of the nucleic acid.

In some embodiments, native ends of target nucleic acids are notmodified in length prior to combining with the oligonucleotides herein.In this context, “not modified” means that target nucleic acids areisolated from a sample and then combined with oligonucleotides withoutmodifying the length of the native ends of target nucleic acids. Forexample, target nucleic acids are not shortened (e.g., they are notcontacted with a restriction enzyme or nuclease or physical conditionthat reduces length (e.g., shearing condition, cleavage condition) togenerate non-native ends) and are not increased in length by one or morenucleotides (e.g., native ends are not filled in at overhangs; nonucleotides are added to the native ends). Adding a phosphate orchemically reactive group to one or both native ends of a target nucleicacid generally is not considered modifying the length of the nucleicacid.

In some embodiments, target nucleic acids are not contacting with acleavage agent (e.g., endonuclease, exonuclease, restriction enzyme)and/or a polymerase prior to combining with the oligonucleotides herein.In some embodiments, target nucleic acids are not subjected tomechanical shearing (e.g., ultrasonication (e.g., Adaptive FocusedAcoustics™ (AFA) process by Covaris)) prior to combining with theoligonucleotides herein. In some embodiments, target nucleic acids arenot contacting with an exonuclease (e.g., DNAse) prior to combining withthe oligonucleotides herein. In some embodiments, target nucleic acidsare not amplified prior to combining with the oligonucleotides herein.In some embodiments, target nucleic acids are not attached to a solidsupport prior to combining with the oligonucleotides herein. In someembodiments, target nucleic acids are not conjugated to another moleculeprior to combining with the oligonucleotides herein. In someembodiments, target nucleic acids are not cloned into a vector prior tocombining with the oligonucleotides herein. In some embodiments, targetnucleic acids may be subjected to dephosphorylation prior to combiningwith the oligonucleotides herein. In some embodiments, target nucleicacids may be subjected to phosphorylation prior to combining with theoligonucleotides herein.

In some embodiments, combining target nucleic acids with theoligonucleotides herein comprises isolating the target nucleic acids,and combing the isolated target nucleic acids with the oligonucleotidesherein. In some embodiments, combining target nucleic acids with theoligonucleotides herein comprises isolating the target nucleic acids,phosphorylating the isolated target nucleic acids, and combing thephosphorylated target nucleic acids with the oligonucleotides herein. Insome embodiments, combining target nucleic acids with theoligonucleotides herein comprises isolating the target nucleic acids,dephosphorylating the oligonucleotides, and combing the isolated targetnucleic acids with the dephosphorylated oligonucleotides herein. In someembodiments, combining target nucleic acids with the oligonucleotidesherein comprises isolating the target nucleic acids, dephosphorylatingthe isolated target nucleic acids, phosphorylating the dephosphorylatedtarget nucleic acids, and combing the phosphorylated target nucleicacids with the oligonucleotides herein. In some embodiments, combiningtarget nucleic acids with the oligonucleotides herein comprisesisolating the target nucleic acids, dephosphorylating the isolatedtarget nucleic acids, phosphorylating the dephosphorylated targetnucleic acids, dephosphorylating the oligonucleotides, and combing thephosphorylated target nucleic acids with the dephosphorylatedoligonucleotides herein.

In some embodiments, combining target nucleic acids with theoligonucleotides herein consists of isolating the target nucleic acids,and combing the isolated target nucleic acids with the oligonucleotidesherein. In some embodiments, combining target nucleic acids with theoligonucleotides herein consists of isolating the target nucleic acids,phosphorylating the isolated target nucleic acids, and combing thephosphorylated target nucleic acids with the oligonucleotides herein. Insome embodiments, combining target nucleic acids with theoligonucleotides herein consists of isolating the target nucleic acids,dephosphorylating the oligonucleotides, and combing the isolated targetnucleic acids with the dephosphorylated oligonucleotides herein. In someembodiments, combining target nucleic acids with the oligonucleotidesherein consists of isolating the target nucleic acids, dephosphorylatingthe isolated target nucleic acids, phosphorylating the dephosphorylatedtarget nucleic acids, and combing the phosphorylated target nucleicacids with the oligonucleotides herein. In some embodiments, combiningtarget nucleic acids with the oligonucleotides herein consists ofisolating the target nucleic acids, dephosphorylating the isolatedtarget nucleic acids, phosphorylating the dephosphorylated targetnucleic acids, dephosphorylating the oligonucleotides, and combing thephosphorylated target nucleic acids with the dephosphorylatedoligonucleotides herein.

Enriching Nucleic Acids

In some embodiments, nucleic acid (e.g., extracellular nucleic acid) isenriched or relatively enriched for a subpopulation or species ofnucleic acid. Nucleic acid subpopulations can include, for example,fetal nucleic acid, maternal nucleic acid, cancer nucleic acid, tumornucleic acid, patient nucleic acid, host nucleic acid, pathogen nucleicacid, transplant nucleic acid, microbiome nucleic acid, nucleic acidcomprising fragments of a particular length or range of lengths, ornucleic acid from a particular genome region (e.g., single chromosome,set of chromosomes, and/or certain chromosome regions). Such enrichedsamples can be used in conjunction with a method provided herein. Thus,in certain embodiments, methods of the technology comprise an additionalstep of enriching for a subpopulation of nucleic acid in a sample. Incertain embodiments, nucleic acid from normal tissue (e.g., non-cancercells, host cells) is selectively removed (partially, substantially,almost completely or completely) from the sample. In certainembodiments, maternal nucleic acid is selectively removed (partially,substantially, almost completely or completely) from the sample. Incertain embodiments, enriching for a particular low copy number speciesnucleic acid (e.g., cancer, tumor, fetal, pathogen, transplant,microbiome nucleic acid) may improve quantitative sensitivity. Methodsfor enriching a sample for a particular species of nucleic acid aredescribed, for example, in U.S. Pat. No. 6,927,028, International PatentApplication Publication No. WO2007/140417, International PatentApplication Publication No. WO2007/147063, International PatentApplication Publication No. WO2009/032779, International PatentApplication Publication No. WO2009/032781, International PatentApplication Publication No. WO2010/033639, International PatentApplication Publication No. WO2011/034631, International PatentApplication Publication No. WO2006/056480, and International PatentApplication Publication No. WO2011/143659, the entire content of each isincorporated herein by reference, including all text, tables, equationsand drawings.

In some embodiments, nucleic acid is enriched for certain targetfragment species and/or reference fragment species. In certainembodiments, nucleic acid is enriched for a specific nucleic acidfragment length or range of fragment lengths using one or morelength-based separation methods described below. In certain embodiments,nucleic acid is enriched for fragments from a select genomic region(e.g., chromosome) using one or more sequence-based separation methodsdescribed herein and/or known in the art.

Non-limiting examples of methods for enriching for a nucleic acidsubpopulation in a sample include methods that exploit epigeneticdifferences between nucleic acid species (e.g., methylation-based fetalnucleic acid enrichment methods described in U.S. Patent ApplicationPublication No. 2010/0105049, which is incorporated by referenceherein); restriction endonuclease enhanced polymorphic sequenceapproaches (e.g., such as a method described in U.S. Patent ApplicationPublication No. 2009/0317818, which is incorporated by referenceherein); selective enzymatic degradation approaches; massively parallelsignature sequencing (MPSS) approaches; amplification (e.g., PCR)-basedapproaches (e.g., loci-specific amplification methods, multiplex SNPallele PCR approaches; universal amplification methods); pull-downapproaches (e.g., biotinylated ultramer pull-down methods); extensionand ligation-based methods (e.g., molecular inversion probe (MIP)extension and ligation); and combinations thereof.

In some embodiments, nucleic acid is enriched for fragments from aselect genomic region (e.g., chromosome) using one or moresequence-based separation methods described herein. Sequence-basedseparation generally is based on nucleotide sequences present in thefragments of interest (e.g., target and/or reference fragments) andsubstantially not present in other fragments of the sample or present inan insubstantial amount of the other fragments (e.g., 5% or less). Insome embodiments, sequence-based separation can generate separatedtarget fragments and/or separated reference fragments. Separated targetfragments and/or separated reference fragments often are isolated awayfrom the remaining fragments in the nucleic acid sample. In certainembodiments, the separated target fragments and the separated referencefragments also are isolated away from each other (e.g., isolated inseparate assay compartments). In certain embodiments, the separatedtarget fragments and the separated reference fragments are isolatedtogether (e.g., isolated in the same assay compartment). In someembodiments, unbound fragments can be differentially removed or degradedor digested.

In some embodiments, a selective nucleic acid capture process is used toseparate target and/or reference fragments away from a nucleic acidsample. Commercially available nucleic acid capture systems include, forexample, Nimblegen sequence capture system (Roche NimbleGen, Madison,WI); Illumina BEADARRAY platform (Illumina, San Diego, CA); AffymetrixGENECHIP platform (Affymetrix, Santa Clara, CA); Agilent SureSelectTarget Enrichment System (Agilent Technologies, Santa Clara, CA); andrelated platforms. Such methods typically involve hybridization of acapture oligonucleotide to a part or all of the nucleotide sequence of atarget or reference fragment and can include use of a solid phase (e.g.,solid phase array) and/or a solution based platform. Captureoligonucleotides (sometimes referred to as “bait”) can be selected ordesigned such that they preferentially hybridize to nucleic acidfragments from selected genomic regions or loci, or a particularsequence in a nucleic acid target. In certain embodiments, ahybridization-based method (e.g., using oligonucleotide arrays) can beused to enrich for fragments containing certain nucleic acid sequences.Thus, in some embodiments, a nucleic acid sample is optionally enrichedby capturing a subset of fragments using capture oligonucleotidescomplementary to, for example, selected sequences in sample nucleicacid. In certain instances, captured fragments are amplified. Forexample, captured fragments containing adapters may be amplified usingprimers complementary to the adapter oligonucleotides to formcollections of amplified fragments, indexed according to adaptersequence. In some embodiments, nucleic acid is enriched for fragmentsfrom a select genomic region (e.g., chromosome, a gene) by amplificationof one or more regions of interest using oligonucleotides (e.g., PCRprimers) complementary to sequences in fragments containing theregion(s) of interest, or part(s) thereof.

In some embodiments, nucleic acid is enriched for a particular nucleicacid fragment length, range of lengths, or lengths under or over aparticular threshold or cutoff using one or more length-based separationmethods. Nucleic acid fragment length typically refers to the number ofnucleotides in the fragment. Nucleic acid fragment length also issometimes referred to as nucleic acid fragment size. In someembodiments, a length-based separation method is performed withoutmeasuring lengths of individual fragments. In some embodiments, a lengthbased separation method is performed in conjunction with a method fordetermining length of individual fragments. In some embodiments,length-based separation refers to a size fractionation procedure whereall or part of the fractionated pool can be isolated (e.g., retained)and/or analyzed. Size fractionation procedures are known in the art(e.g., separation on an array, separation by a molecular sieve,separation by gel electrophoresis, separation by column chromatography(e.g., size-exclusion columns), and microfluidics-based approaches). Incertain instances, length-based separation approaches can includeselective sequence tagging approaches, fragment circularization,chemical treatment (e.g., formaldehyde, polyethylene glycol (PEG)precipitation), mass spectrometry and/or size-specific nucleic acidamplification, for example.

In some aspects, a method comprises enriching for a species of targetnucleic acid. For example, a method herein may comprise enriching for aspecies of target nucleic acid having a particular overhang feature(e.g., length, type (5′, 3′), sequence). Enrichment for a species oftarget nucleic acid having a particular overhang feature may be achievedaccording to a particular overhang identification sequence. For example,certain target nucleic acids complexed with oligonucleotides describedherein may be separated from the rest of the target nucleic acidsaccording to a particular overhang identification sequence (e.g.,according the sequence, or according to another feature (e.g.,modification) of the overhang identification sequence). In someembodiments, a method comprises associating complexes (target nucleicacids joined to oligonucleotides herein) with one or more binding agentsthat specifically hybridize to a particular overhang identificationsequence, thereby generating enriched complexes. For the term“specifically hybridize,” specific, or specificity, generally refers tothe binding or hybridization of one molecule to another molecule (e.g.,a polynucleotide strand to a complementary strand). That is, specific orspecificity refers to the recognition, contact, and formation of astable complex between two molecules, as compared to substantially lessrecognition, contact, or complex formation of either of those twomolecules with other molecules. The term hybridize generally refers tothe formation of a stable complex between two molecules.

In some aspects, a polynucleotide complementary to a particular overhangidentification sequence comprises a member of a binding pair. In someaspects, one or more nucleotides (e.g., one or more modifiednucleotides) in a particular overhang identification sequence comprisesa member of a binding pair. Binding pairs may include, for example,antibody/antigen, antibody/antibody, antibody/antibody fragment,antibody/antibody receptor, antibody/protein A or protein G,hapten/anti-hapten, biotin/avidin, biotin/streptavidin, folicacid/folate binding protein, vitamin B12/intrinsic factor, chemicalreactive group/complementary chemical reactive group, digoxigeninmoiety/anti-digoxigenin antibody, fluorescein moiety/anti-fluoresceinantibody, steroid/steroid-binding protein, operator/repressor,nuclease/nucleotide, lectin/polysaccharide, active compound/activecompound receptor, hormone/hormone receptor, enzyme/substrate,oligonucleotide or polynucleotide/its corresponding complement, the likeor combinations thereof.

In some embodiments, one or more binding agents that specificallyhybridize to a particular overhang identification sequence may beattached to a solid support (e.g., bead or any suitable solid supportdescribed herein or known in the art). Enrichment for target nucleicacids having a particular species of overhang may be subsequentlyachieved according to any suitable method for separating biomolecules(e.g., pull down assays, use of solid supports, and the like).

Length-Based Separation

In some embodiments, a method herein comprises separating target nucleicacids according to fragment length. For example, target nucleic acidsmay be enriched for a particular nucleic acid fragment length, range oflengths, or lengths under or over a particular threshold or cutoff usingone or more length-based separation methods. Nucleic acid fragmentlength typically refers to the number of nucleotides in the fragment.Nucleic acid fragment length also may be referred to as nucleic acidfragment size. In some embodiments, a length-based separation method isperformed without measuring lengths of individual fragments. In someembodiments, a length based separation method is performed inconjunction with a method for determining length of individualfragments. In some embodiments, length-based separation refers to a sizefractionation procedure where all or part of the fractionated pool canbe isolated (e.g., retained) and/or analyzed. Size fractionationprocedures are known in the art (e.g., separation on an array,separation by a molecular sieve, separation by gel electrophoresis,separation by column chromatography (e.g., size-exclusion columns), andmicrofluidics-based approaches). In some embodiments, length-basedseparation approaches can include fragment circularization, chemicaltreatment (e.g., formaldehyde, polyethylene glycol (PEG)), massspectrometry and/or size-specific nucleic acid amplification, forexample. In some embodiments, length based-separation is performed usingSolid Phase Reversible Immobilization (SPRI) beads.

In some embodiments, nucleic acid fragments of a certain length, rangeof lengths, or lengths under or over a particular threshold or cutoffare separated from the sample. In some embodiments, fragments having alength under a particular threshold or cutoff (e.g., 500 bp, 400 bp, 300bp, 200 bp, 150 bp, 100 bp) are referred to as “short” fragments andfragments having a length over a particular threshold or cutoff (e.g.,500 bp, 600 bp, 700 bp, 800 bp, 900 bp, 1000 bp) are referred to as“long” fragments, large fragments, and/or high molecular weight (HMW)fragments. In some embodiments, fragments of a certain length, range oflengths, or lengths under or over a particular threshold or cutoff areretained for analysis while fragments of a different length or range oflengths, or lengths over or under the threshold or cutoff are notretained for analysis. In some embodiments, fragments that are less thanabout 500 bp are retained. In some embodiments, fragments that are lessthan about 400 bp are retained. In some embodiments, fragments that areless than about 300 bp are retained. In some embodiments, fragments thatare less than about 200 bp are retained. In some embodiments, fragmentsthat are less than about 150 bp are retained. For example, fragmentsthat are less than about 190 bp, 180 bp, 170 bp, 160 bp, 150 bp, 140 bp,130 bp, 120 bp, 110 bp or 100 bp are retained. In some embodiments,fragments that are about 100 bp to about 200 bp are retained. Forexample, fragments that are about 190 bp, 180 bp, 170 bp, 160 bp, 150bp, 140 bp, 130 bp, 120 bp or 110 bp are retained. In some embodiments,fragments that are in the range of about 100 bp to about 200 bp areretained. For example, fragments that are in the range of about 110 bpto about 190 bp, 130 bp to about 180 bp, 140 bp to about 170 bp, 140 bpto about 150 bp, 150 bp to about 160 bp, or 145 bp to about 155 bp areretained.

In some embodiments, target nucleic acids having fragment lengths ofless than about 1000 bp are combined with a plurality or pool ofoligonucleotide species described herein. In some embodiments, targetnucleic acids having fragment lengths of less than about 500 bp arecombined with a plurality or pool of oligonucleotide species describedherein. In some embodiments, target nucleic acids having fragmentlengths of less than about 400 bp are combined with a plurality or poolof oligonucleotide species described herein. In some embodiments, targetnucleic acids having fragment lengths of less than about 300 bp arecombined with a plurality or pool of oligonucleotide species describedherein. In some embodiments, target nucleic acids having fragmentlengths of less than about 200 bp are combined with a plurality or poolof oligonucleotide species described herein. In some embodiments, targetnucleic acids having fragment lengths of less than about 100 bp arecombined with a plurality or pool of oligonucleotide species describedherein.

In some embodiments, target nucleic acids having fragment lengths ofabout 100 bp or more are combined with a plurality or pool ofoligonucleotide species described herein. In some embodiments, targetnucleic acids having fragment lengths of about 200 bp or more arecombined with a plurality or pool of oligonucleotide species describedherein. In some embodiments, target nucleic acids having fragmentlengths of about 300 bp or more are combined with a plurality or pool ofoligonucleotide species described herein. In some embodiments, targetnucleic acids having fragment lengths of about 400 bp or more arecombined with a plurality or pool of oligonucleotide species describedherein. In some embodiments, target nucleic acids having fragmentlengths of about 500 bp or more are combined with a plurality or pool ofoligonucleotide species described herein. In some embodiments, targetnucleic acids having fragment lengths of about 1000 bp or more arecombined with a plurality or pool of oligonucleotide species describedherein.

In some embodiments, target nucleic acids having any fragment length orany combination of fragment lengths are combined with a plurality orpool of oligonucleotide species described herein. For example, targetnucleic acids having fragment lengths of less than 500 bp and fragmentslengths of 500 bp or more may be combined with a plurality or pool ofoligonucleotide species described herein.

Certain length-based separation methods that can be used with methodsdescribed herein employ a selective sequence tagging approach, forexample. In such methods, a fragment size species (e.g., shortfragments) nucleic acids are selectively tagged in a sample thatincludes long and short nucleic acids. Such methods typically involveperforming a nucleic acid amplification reaction using a set of nestedprimers which include inner primers and outer primers. In someembodiments, one or both of the inner can be tagged to thereby introducea tag onto the target amplification product. The outer primers generallydo not anneal to the short fragments that carry the (inner) targetsequence. The inner primers can anneal to the short fragments andgenerate an amplification product that carries a tag and the targetsequence. Typically, tagging of the long fragments is inhibited througha combination of mechanisms which include, for example, blockedextension of the inner primers by the prior annealing and extension ofthe outer primers. Enrichment for tagged fragments can be accomplishedby any of a variety of methods, including for example, exonucleasedigestion of single stranded nucleic acid and amplification of thetagged fragments using amplification primers specific for at least onetag.

Another length-based separation method that can be used with methodsdescribed herein involves subjecting a nucleic acid sample topolyethylene glycol (PEG) precipitation. Examples of methods includethose described in International Patent Application Publication Nos.WO2007/140417 and WO2010/115016. This method in general entailscontacting a nucleic acid sample with PEG in the presence of one or moremonovalent salts under conditions sufficient to substantiallyprecipitate large nucleic acids without substantially precipitatingsmall (e.g., less than 300 nucleotides) nucleic acids.

Another length-based enrichment method that can be used with methodsdescribed herein involves circularization by ligation, for example,using circligase. Short nucleic acid fragments typically can becircularized with higher efficiency than long fragments.Non-circularized sequences can be separated from circularized sequences,and the enriched short fragments can be used for further analysis.

Nucleic Acid Library

Methods herein may include preparing a nucleic acid library and/ormodifying nucleic acids for a nucleic acid library. In some embodiments,ends of nucleic acid fragments are modified such that the fragments, oramplified products thereof, may be incorporated into a nucleic acidlibrary. Generally, a nucleic acid library refers to a plurality ofpolynucleotide molecules (e.g., a sample of nucleic acids) that areprepared, assembled and/or modified for a specific process, non-limitingexamples of which include immobilization on a solid phase (e.g., a solidsupport, a flow cell, a bead), enrichment, amplification, cloning,detection and/or for nucleic acid sequencing. In certain embodiments, anucleic acid library is prepared prior to or during a sequencingprocess. A nucleic acid library (e.g., sequencing library) can beprepared by a suitable method as known in the art. A nucleic acidlibrary can be prepared by a targeted or a non-targeted preparationprocess.

In some embodiments, a library of nucleic acids is modified to comprisea chemical moiety (e.g., a functional group) configured forimmobilization of nucleic acids to a solid support. In some embodimentsa library of nucleic acids is modified to comprise a biomolecule (e.g.,a functional group) and/or member of a binding pair configured forimmobilization of the library to a solid support, non-limiting examplesof which include thyroxin-binding globulin, steroid-binding proteins,antibodies, antigens, haptens, enzymes, lectins, nucleic acids,repressors, protein A, protein G, avidin, streptavidin, biotin,complement component C1q, nucleic acid-binding proteins, receptors,carbohydrates, oligonucleotides, polynucleotides, complementary nucleicacid sequences, the like and combinations thereof. Some examples ofspecific binding pairs include, without limitation: an avidin moiety anda biotin moiety; an antigenic epitope and an antibody or immunologicallyreactive fragment thereof; an antibody and a hapten; a digoxigeninmoiety and an anti-digoxigenin antibody; a fluorescein moiety and ananti-fluorescein antibody; an operator and a repressor; a nuclease and anucleotide; a lectin and a polysaccharide; a steroid and asteroid-binding protein; an active compound and an active compoundreceptor; a hormone and a hormone receptor; an enzyme and a substrate;an immunoglobulin and protein A; an oligonucleotide or polynucleotideand its corresponding complement; the like or combinations thereof.

In some embodiments, a library of nucleic acids is modified to compriseone or more polynucleotides of known composition, non-limiting examplesof which include an identifier (e.g., a tag, an indexing tag), a capturesequence, a label, an adapter, a restriction enzyme site, a promoter, anenhancer, an origin of replication, a stem loop, a complimentarysequence (e.g., a primer binding site, an annealing site), a suitableintegration site (e.g., a transposon, a viral integration site), amodified nucleotide, an overhang identification sequence (i.e., uniqueend identifier (UEI)) described herein, a unique molecular identifier(UMI) described herein, a palindromic sequence described herein, thelike or combinations thereof. Polynucleotides of known sequence can beadded at a suitable position, for example on the 5′ end, 3′ end orwithin a nucleic acid sequence. Polynucleotides of known sequence can bethe same or different sequences. In some embodiments, a polynucleotideof known sequence is configured to hybridize to one or moreoligonucleotides immobilized on a surface (e.g., a surface in flowcell). For example, a nucleic acid molecule comprising a 5′ knownsequence may hybridize to a first plurality of oligonucleotides whilethe 3′ known sequence may hybridize to a second plurality ofoligonucleotides. In some embodiments, a library of nucleic acid cancomprise chromosome-specific tags, capture sequences, labels and/oradapters (e.g., oligonucleotide adapters described herein). In someembodiments, a library of nucleic acids comprises one or more detectablelabels. In some embodiments one or more detectable labels may beincorporated into a nucleic acid library at a 5′ end, at a 3′ end,and/or at any nucleotide position within a nucleic acid in the library.In some embodiments, a library of nucleic acids comprises hybridizedoligonucleotides. In certain embodiments hybridized oligonucleotides arelabeled probes. In some embodiments, a library of nucleic acidscomprises hybridized oligonucleotide probes prior to immobilization on asolid phase.

In some embodiments, a polynucleotide of known sequence comprises auniversal sequence. A universal sequence is a specific nucleotidesequence that is integrated into two or more nucleic acid molecules ortwo or more subsets of nucleic acid molecules where the universalsequence is the same for all molecules or subsets of molecules that itis integrated into. A universal sequence is often designed to hybridizeto and/or amplify a plurality of different sequences using a singleuniversal primer that is complementary to a universal sequence. In someembodiments two (e.g., a pair) or more universal sequences and/oruniversal primers are used. A universal primer often comprises auniversal sequence. In some embodiments adapters (e.g., universaladapters) comprise universal sequences. In some embodiments one or moreuniversal sequences are used to capture, identify and/or detect multiplespecies or subsets of nucleic acids.

In certain embodiments of preparing a nucleic acid library, (e.g., incertain sequencing by synthesis procedures), nucleic acids are sizeselected and/or fragmented into lengths of several hundred base pairs,or less (e.g., in preparation for library generation). In someembodiments, library preparation is performed without fragmentation(e.g., when using cell-free DNA).

In certain embodiments, a ligation-based library preparation method isused (e.g., ILLUMINA TRUSEQ, Illumina, San Diego CA). Ligation-basedlibrary preparation methods often make use of an adapter (e.g., amethylated adapter) design which can incorporate an index sequence(e.g., a sample index sequence to identify sample origin for a nucleicacid sequence) at the initial ligation step and often can be used toprepare samples for single-read sequencing, paired-end sequencing andmultiplexed sequencing. For example, nucleic acids (e.g., fragmentednucleic acids or cell-free DNA) may be end repaired by a fill-inreaction, an exonuclease reaction or a combination thereof. In someembodiments, the resulting blunt-end repaired nucleic acid can then beextended by a single nucleotide, which is complementary to a singlenucleotide overhang on the 3′ end of an adapter/primer. Any nucleotidecan be used for the extension/overhang nucleotides. In some embodiments,end repair is omitted and adapter oligonucleotides (e.g.,oligonucleotides described herein) are ligated directly to the nativeends of nucleic acids (e.g., fragmented nucleic acids or cell-free DNA).

In some embodiments, nucleic acid library preparation comprises ligatingan adapter oligonucleotide (e.g., to a sample nucleic acid, to a samplenucleic acid fragment, to a template nucleic acid, to a target nucleicacid), such as an adapter oligonucleotide described herein. Adapteroligonucleotides are often complementary to flow-cell anchors, andsometimes are utilized to immobilize a nucleic acid library to a solidsupport, such as the inside surface of a flow cell, for example. In someembodiments, an adapter oligonucleotide comprises an identifier, one ormore sequencing primer hybridization sites (e.g., sequencescomplementary to universal sequencing primers, single end sequencingprimers, paired end sequencing primers, multiplexed sequencing primers,and the like), or combinations thereof (e.g., adapter/sequencing,adapter/identifier, adapter/identifier/sequencing). In some embodiments,an adapter oligonucleotide comprises one or more of primer annealingpolynucleotide, also referred to herein as priming sequence or primerbinding domain, (e.g., for annealing to flow cell attachedoligonucleotides and/or to free amplification primers), an indexpolynucleotide (e.g., sample index sequence for tracking nucleic acidfrom different samples; also referred to as a sample ID), an overhangidentification sequence (also referred to herein as and a unique endidentifier (UEI)) barcode polynucleotide (e.g., single molecule barcode(SMB) for tracking individual molecules of sample nucleic acid that areamplified prior to sequencing; also referred to as a molecular barcodeor a unique molecular identifier (UMI)). In some embodiments, a primerannealing component (or priming sequence or primer binding domain) of anadapter oligonucleotide comprises one or more universal sequences (e.g.,sequences complementary to one or more universal amplification primers).In some embodiments, an index polynucleotide (e.g., sample index; sampleID) is a component of an adapter oligonucleotide. In some embodiments,an index polynucleotide (e.g., sample index; sample ID) is a componentof a universal amplification primer sequence.

In some embodiments, adapter oligonucleotides when used in combinationwith amplification primers (e.g., universal amplification primers) aredesigned generate library constructs comprising one or more of:universal sequences, molecular barcodes, sample ID sequences, spacersequences, and a sample nucleic acid sequence. In some embodiments,adapter oligonucleotides when used in combination with universalamplification primers are designed generate library constructscomprising an ordered combination of one or more of: universalsequences, molecular barcodes, sample ID sequences, spacer sequences,and a sample nucleic acid sequence. For example, a library construct maycomprise a first universal sequence, followed by a second universalsequence, followed by first molecular barcode, followed by a spacersequence, followed by a template sequence (e.g., sample nucleic acidsequence), followed by a spacer sequence, followed by a second molecularbarcode, followed by a third universal sequence, followed by a sampleID, followed by a fourth universal sequence. In some embodiments,adapter oligonucleotides when used in combination with amplificationprimers (e.g., universal amplification primers) are designed generatelibrary constructs for each strand of a template molecule (e.g., samplenucleic acid molecule). In some embodiments, adapter oligonucleotidesare duplex adapter oligonucleotides.

An identifier can be a suitable detectable label incorporated into orattached to a nucleic acid (e.g., a polynucleotide) that allowsdetection and/or identification of nucleic acids that comprise theidentifier. In some embodiments, an identifier is incorporated into orattached to a nucleic acid during a sequencing method (e.g., by apolymerase). Non-limiting examples of identifiers include nucleic acidtags, nucleic acid indexes or barcodes, a radiolabel (e.g., an isotope),metallic label, a fluorescent label, a chemiluminescent label, aphosphorescent label, a fluorophore quencher, a dye, a protein (e.g., anenzyme, an antibody or part thereof, a linker, a member of a bindingpair), the like or combinations thereof. In some embodiments, anidentifier (e.g., a nucleic acid index or barcode) is a unique, knownand/or identifiable sequence of nucleotides or nucleotide analogues. Insome embodiments identifiers are six or more contiguous nucleotides. Amultitude of fluorophores are available with a variety of differentexcitation and emission spectra. Any suitable type and/or number offluorophores can be used as an identifier. In some embodiments 1 ormore, 2 or more, 3 or more, 4 or more, 5 or more, 6 or more, 7 or more,8 or more, 9 or more, 10 or more, 20 or more, 30 or more or 50 or moredifferent identifiers are utilized in a method described herein (e.g., anucleic acid detection and/or sequencing method). In some embodiments,one or two types of identifiers (e.g., fluorescent labels) are linked toeach nucleic acid in a library. Detection and/or quantification of anidentifier can be performed by a suitable method, apparatus or machine,non-limiting examples of which include flow cytometry, quantitativepolymerase chain reaction (qPCR), gel electrophoresis, a luminometer, afluorometer, a spectrophotometer, a suitable gene-chip or microarrayanalysis, Western blot, mass spectrometry, chromatography,cytofluorimetric analysis, fluorescence microscopy, a suitablefluorescence or digital imaging method, confocal laser scanningmicroscopy, laser scanning cytometry, affinity chromatography, manualbatch mode separation, electric field suspension, a suitable nucleicacid sequencing method and/or nucleic acid sequencing apparatus, thelike and combinations thereof.

In some embodiments, a nucleic acid library or parts thereof areamplified (e.g., amplified by a PCR-based method) under amplificationconditions. In some embodiments, a sequencing method comprisesamplification of a nucleic acid library. A nucleic acid library can beamplified prior to or after immobilization on a solid support (e.g., asolid support in a flow cell). Nucleic acid amplification includes theprocess of amplifying or increasing the numbers of a nucleic acidtemplate and/or of a complement thereof that are present (e.g., in anucleic acid library), by producing one or more copies of the templateand/or its complement. Amplification can be carried out by a suitablemethod. A nucleic acid library can be amplified by a thermocyclingmethod or by an isothermal amplification method. In some embodiments, arolling circle amplification method is used. In some embodimentsamplification takes place on a solid support (e.g., within a flow cell)where a nucleic acid library or portion thereof is immobilized. Incertain sequencing methods, a nucleic acid library is added to a flowcell and immobilized by hybridization to anchors under suitableconditions. This type of nucleic acid amplification is often referred toas solid phase amplification. In some embodiments of solid phaseamplification, all or a portion of the amplified products aresynthesized by an extension initiating from an immobilized primer. Solidphase amplification reactions are analogous to standard solution phaseamplifications except that at least one of the amplificationoligonucleotides (e.g., primers) is immobilized on a solid support. Insome embodiments, modified nucleic acid (e.g., nucleic acid modified byaddition of adapters) is amplified.

In some embodiments, solid phase amplification comprises a nucleic acidamplification reaction comprising only one species of oligonucleotideprimer immobilized to a surface. In certain embodiments, solid phaseamplification comprises a plurality of different immobilizedoligonucleotide primer species. In some embodiments, solid phaseamplification may comprise a nucleic acid amplification reactioncomprising one species of oligonucleotide primer immobilized on a solidsurface and a second different oligonucleotide primer species insolution. Multiple different species of immobilized or solution basedprimers can be used. Non-limiting examples of solid phase nucleic acidamplification reactions include interfacial amplification, bridgeamplification, emulsion PCR, WildFire amplification (e.g., U.S. PatentApplication Publication No. 2013/0012399), the like or combinationsthereof.

Nucleic Acid Sequencing

In some embodiments, nucleic acid (e.g., nucleic acid fragments, samplenucleic acid, cell-free nucleic acid) is sequenced. In some embodiments,nucleic acid targets hybridized to oligonucleotides provided herein(“hybridization products”) are sequenced by a sequencing process. Insome embodiments, hybridization products are amplified by anamplification process, and the amplification products are sequenced by asequencing process. In some embodiments, the sequencing processgenerates sequence reads (or sequencing reads). In some embodiments, amethod herein comprises determining the sequence of an overhang fortarget nucleic acids based on the sequence reads. In some embodiments, amethod herein comprises determining a sequence of an overhangidentification sequence or unique end identifier (UEI) based on thesequence reads. In some embodiments, a method herein comprisesdetermining the sequence of elements comprising an overhangidentification sequence or unique end identifier (UEI) and an overhangfor target nucleic acids based on the sequence reads. In someembodiments, a method herein comprises determining the sequence ofelements consisting of an overhang identification sequence or unique endidentifier (UEI) and an overhang for target nucleic acids based on thesequence reads. In some embodiments, a method herein comprisesdetermining lengths of the overhangs for target nucleic acids accordingto the sequence reads.

For certain sequencing platforms (e.g., paired-end sequencing),generating sequence reads may include generating forward sequence readsand generating reverse sequence reads. For example, sequencing usingcertain paired-end sequencing platforms sequence each nucleic acidfragment from both directions, generally resulting in two reads pernucleic acid fragment, with the first read in a forward orientation(forward read) and the second read in reverse-complement orientation(reverse read). For certain platforms, a forward read is generated off aparticular primer within a sequencing adapter (e.g., Illumina adapter,P5 primer), and a reverse read is generated off a different primerwithin a sequencing adapter (e.g., Illumina adapter, P7 primer).

In some embodiments, a method herein comprises analyzing (e.g.,quantifying, processing) a subset of sequence reads. In someembodiments, a method herein comprises analyzing (e.g., quantifying,processing) a subset of sequence reads and omitting another subset ofsequence reads from the analysis. In some embodiments, a method hereincomprises analyzing or processing overhang information for a subset ofsequence reads. In some embodiments, a method herein comprises analyzing(e.g., quantifying, processing) reverse sequence reads. In someembodiments, a method herein comprises analyzing or processing overhanginformation for reverse sequence reads. In some embodiments, a methodherein comprises analyzing or processing overhang information associatedwith overhang identification sequences for reverse sequence reads. Insome embodiments, a method herein comprises analyzing (e.g.,quantifying, processing) P7 sequence reads. In some embodiments, amethod herein comprises analyzing (e.g., quantifying, processing)overhang information generated from P7 sequence reads. In someembodiments, a method herein comprises analyzing (e.g., quantifying,processing) overhang information associated with overhang identificationsequences generated from P7 sequence reads.

In some embodiments, a method herein comprises omitting forward sequencereads from an analysis. In some embodiments, a method herein comprisesomitting overhang information generated from forward sequence reads froman analysis. In some embodiments, a method herein comprises omittingoverhang information associated with overhang identification sequencesgenerated from forward sequence reads from an analysis. In someembodiments, a method herein comprises omitting P5 sequence reads froman analysis. In some embodiments, a method herein comprises omittingoverhang information generated from P5 sequence reads from an analysis.In some embodiments, a method herein comprises omitting overhanginformation associated with overhang identification sequences generatedfrom P5 sequence reads from an analysis.

In some embodiments, forward reads as a whole are not excluded entirely.For example, the overhang identification sequence of a forward read maybe ignored and thus the overhang inferred from the forward read overhangidentification sequence is excluded from an overhang analysis; and onlyoverhangs from the reverse reads are analyzed. In such instances, otheraspects of the forward reads may be included in an analysis, forexample, to infer fragment length, determine GC content, identify singlenucleotide variants, or identify blunt ends.

In some embodiments, a method herein comprises analyzing or processingoverhang information associated with overhang identification sequencesthat indicate no overhang (i.e., blunt end) for reverse sequence reads.In some embodiments, a method herein comprises analyzing or processingoverhang information associated with overhang identification sequencesthat indicate no overhang (i.e., blunt end) for forward sequence reads.In some embodiments, a method herein comprises analyzing or processingoverhang information associated with overhang identification sequencesthat indicate no overhang (i.e., blunt end) for forward and reversesequence reads. Thus, in some embodiments, where an overhangidentification sequence indicates no overhang (i.e., blunt end), noinformation about the blunt end is omitted from the analysis.

In some embodiments, a method herein comprises analyzing or processingoverhang information associated with overhang identification sequencesthat indicate no overhang (i.e., blunt end) for forward and reversesequence reads, analyzing or processing overhang information associatedwith overhang identification sequences that indicate presence of anoverhang for reverse sequence reads, and omitting overhang informationassociated with overhang identification sequences that indicate presenceof an overhang for forward sequence reads from the analysis. Thus, ananalysis of nucleic acid ends (e.g., native nucleic acid ends) mayinclude analysis of nucleic acid end blunt end information generatedfrom both forward and reverse sequence reads, and nucleic acid overhanginformation generated from reverse reads only.

Nucleic acid may be sequenced using any suitable sequencing platformincluding a Sanger sequencing platform, a high throughput or massivelyparallel sequencing (next generation sequencing (NGS)) platform, or thelike, such as, for example, a sequencing platform provided by Illumina®(e.g., HiSeq™, MiSeq™ and/or Genome Analyzer™ sequencing systems);Oxford Nanopore™ Technologies (e.g., MinION sequencing system), IonTorrent™ (e.g., Ion PGM™ and/or Ion Proton™ sequencing systems); PacificBiosciences (e.g., PACBIO RS II sequencing system); Life Technologies™(e.g., SOLiD sequencing system); Roche (e.g., 454 GS FLX+ and/or GSJunior sequencing systems); or any other suitable sequencing platform.In some embodiments, the sequencing process is a highly multiplexedsequencing process. In certain instances, a full or substantially fullsequence is obtained and sometimes a partial sequence is obtained.Nucleic acid sequencing generally produces a collection of sequencereads. As used herein, “reads” (e.g., “a read,” “a sequence read”) areshort sequences of nucleotides produced by any sequencing processdescribed herein or known in the art. Reads can be generated from oneend of nucleic acid fragments (single-end reads), and sometimes aregenerated from both ends of nucleic acid fragments (e.g., paired-endreads, double-end reads). In some embodiments, a sequencing processgenerates short sequencing reads or “short reads.” In some embodiments,the nominal, average, mean or absolute length of short reads sometimesis about 10 continuous nucleotides to about 250 or more contiguousnucleotides. In some embodiments, the nominal, average, mean or absolutelength of short reads sometimes is about 50 continuous nucleotides toabout 150 or more contiguous nucleotides.

The length of a sequence read is often associated with the particularsequencing technology utilized. High-throughput methods, for example,provide sequence reads that can vary in size from tens to hundreds ofbase pairs (bp). Nanopore sequencing, for example, can provide sequencereads that can vary in size from tens to hundreds to thousands of basepairs. In some embodiments, sequence reads are of a mean, median,average or absolute length of about 15 bp to about 900 bp long. Incertain embodiments sequence reads are of a mean, median, average orabsolute length of about 1000 bp or more. In some embodiments sequencereads are of a mean, median, average or absolute length of about 1500,2000, 2500, 3000, 3500, 4000, 4500, or 5000 bp or more. In someembodiments, sequence reads are of a mean, median, average or absolutelength of about 100 bp to about 200 bp.

In some embodiments, the nominal, average, mean or absolute length ofsingle-end reads sometimes is about 10 continuous nucleotides to about250 or more contiguous nucleotides, about 15 contiguous nucleotides toabout 200 or more contiguous nucleotides, about 15 contiguousnucleotides to about 150 or more contiguous nucleotides, about 15contiguous nucleotides to about 125 or more contiguous nucleotides,about 15 contiguous nucleotides to about 100 or more contiguousnucleotides, about 15 contiguous nucleotides to about 75 or morecontiguous nucleotides, about 15 contiguous nucleotides to about 60 ormore contiguous nucleotides, 15 contiguous nucleotides to about 50 ormore contiguous nucleotides, about 15 contiguous nucleotides to about 40or more contiguous nucleotides, and sometimes about 15 contiguousnucleotides or about 36 or more contiguous nucleotides. In certainembodiments the nominal, average, mean or absolute length of single-endreads is about 20 to about 30 bases, or about 24 to about 28 bases inlength. In certain embodiments the nominal, average, mean or absolutelength of single-end reads is about 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11,12, 13, 14, 15, 16, 17, 18, 19, 21, 22, 23, 24, 25, 26, 27, 28 or about29 bases or more in length. In certain embodiments the nominal, average,mean or absolute length of single-end reads is about 20 to about 200bases, about 100 to about 200 bases, or about 140 to about 160 bases inlength. In certain embodiments the nominal, average, mean or absolutelength of single-end reads is about 30, 40, 50, 60, 70, 80, 90, 100,110, 120, 130, 140, 150, 160, 170, 180, 190, or about 200 bases or morein length. In certain embodiments, the nominal, average, mean orabsolute length of paired-end reads sometimes is about 10 contiguousnucleotides to about 25 contiguous nucleotides or more (e.g., about 10,11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24 or 25 nucleotidesin length or more), about 15 contiguous nucleotides to about 20contiguous nucleotides or more, and sometimes is about 17 contiguousnucleotides or about 18 contiguous nucleotides. In certain embodiments,the nominal, average, mean or absolute length of paired-end readssometimes is about 25 contiguous nucleotides to about 400 contiguousnucleotides or more (e.g., about 25, 30, 40, 50, 60, 70, 80, 90, 100,110, 120, 130, 140,150, 160, 170, 180, 190, 200, 210, 220, 230, 240,250, 260, 270, 280, 290, 300, 310, 320, 330, 340, 350, 360, 370, 380,390, or 400 nucleotides in length or more), about 50 contiguousnucleotides to about 350 contiguous nucleotides or more, about 100contiguous nucleotides to about 325 contiguous nucleotides, about 150contiguous nucleotides to about 325 contiguous nucleotides, about 200contiguous nucleotides to about 325 contiguous nucleotides, about 275contiguous nucleotides to about 310 contiguous nucleotides, about 100contiguous nucleotides to about 200 contiguous nucleotides, about 100contiguous nucleotides to about 175 contiguous nucleotides, about 125contiguous nucleotides to about 175 contiguous nucleotides, andsometimes is about 140 contiguous nucleotides to about 160 contiguousnucleotides. In certain embodiments, the nominal, average, mean, orabsolute length of paired-end reads is about 150 contiguous nucleotides,and sometimes is 150 contiguous nucleotides.

Reads generally are representations of nucleotide sequences in aphysical nucleic acid. For example, in a read containing an ATGCdepiction of a sequence, “A” represents an adenine nucleotide, “T”represents a thymine nucleotide, “G” represents a guanine nucleotide and“C” represents a cytosine nucleotide, in a physical nucleic acid.Sequence reads obtained from a sample from a subject can be reads from amixture of a minority nucleic acid and a majority nucleic acid. Forexample, sequence reads obtained from the blood of a cancer patient canbe reads from a mixture of cancer nucleic acid and non-cancer nucleicacid. In another example, sequence reads obtained from the blood of apregnant female can be reads from a mixture of fetal nucleic acid andmaternal nucleic acid. In another example, sequence reads obtained fromthe blood of a patient having an infection or infectious disease can bereads from a mixture of host nucleic acid and pathogen nucleic acid. Inanother example, sequence reads obtained from the blood of a transplantrecipient can be reads from a mixture of host nucleic acid andtransplant nucleic acid. In another example, sequence reads obtainedfrom a sample can be reads from a mixture of nucleic acid frommicroorganisms collectively comprising a microbiome (e.g., microbiome ofgut, microbiome of blood, microbiome of mouth, microbiome of spinalfluid, microbiome of feces) in a subject. In another example, sequencereads obtained from a sample can be reads from a mixture of nucleic acidfrom microorganisms collectively comprising a microbiome (e.g.,microbiome of gut, microbiome of blood, microbiome of mouth, microbiomeof spinal fluid, microbiome of feces), and nucleic acid from the hostsubject. A mixture of relatively short reads can be transformed byprocesses described herein into a representation of genomic nucleic acidpresent in the subject, and/or a representation of genomic nucleic acidpresent in a tumor, a fetus, a pathogen, a transplant, or a microbiome.

In certain embodiments, “obtaining” nucleic acid sequence reads of asample from a subject and/or “obtaining” nucleic acid sequence reads ofa biological specimen from one or more reference persons can involvedirectly sequencing nucleic acid to obtain the sequence information. Insome embodiments, “obtaining” can involve receiving sequence informationobtained directly from a nucleic acid by another.

In some embodiments, some or all nucleic acids in a sample are enrichedand/or amplified (e.g., non-specifically, e.g., by a PCR based method)prior to or during sequencing. In certain embodiments, specific nucleicacid species or subsets in a sample are enriched and/or amplified priorto or during sequencing. In some embodiments, a species or subset of apre-selected pool of nucleic acids is sequenced randomly. In someembodiments, nucleic acids in a sample are not enriched and/or amplifiedprior to or during sequencing.

In some embodiments, a representative fraction of a genome is sequencedand is sometimes referred to as “coverage” or “fold coverage.” Forexample, a 1-fold coverage indicates that roughly 100% of the nucleotidesequences of the genome are represented by reads. In some instances,fold coverage is referred to as (and is directly proportional to)“sequencing depth.” In some embodiments, “fold coverage” is a relativeterm referring to a prior sequencing run as a reference. For example, asecond sequencing run may have 2-fold less coverage than a firstsequencing run. In some embodiments, a genome is sequenced withredundancy, where a given region of the genome can be covered by two ormore reads or overlapping reads (e.g., a “fold coverage” greater than 1,e.g., a 2-fold coverage). In some embodiments, a genome (e.g., a wholegenome) is sequenced with about 0.01-fold to about 100-fold coverage,about 0.1-fold to 20-fold coverage, or about 0.1-fold to about 1-foldcoverage (e.g., about 0.015-, 0.02-, 0.03-, 0.04-, 0.05-, 0.06-, 0.07-,0.08-, 0.09-, 0.1-, 0.2-, 0.3-, 0.4-, 0.5-, 0.6-, 0.7-, 0.8-, 0.9-, 1-,2-, 3-, 4-, 5-, 6-, 7-, 8-, 9-, 10-, 15-, 20-, 30-, 40-, 50-, 60-, 70-,80-, 90-fold or greater coverage). In some embodiments, specific partsof a genome (e.g., genomic parts from targeted methods) are sequencedand fold coverage values generally refer to the fraction of the specificgenomic parts sequenced (i.e., fold coverage values do not refer to thewhole genome). In some instances, specific genomic parts are sequencedat 1000-fold coverage or more. For example, specific genomic parts maybe sequenced at 2000-fold, 5,000-fold, 10,000-fold, 20,000-fold,30,000-fold, 40,000-fold or 50,000-fold coverage. In some embodiments,sequencing is at about 1,000-fold to about 100,000-fold coverage. Insome embodiments, sequencing is at about 10,000-fold to about70,000-fold coverage. In some embodiments, sequencing is at about20,000-fold to about 60,000-fold coverage. In some embodiments,sequencing is at about 30,000-fold to about 50,000-fold coverage.

In some embodiments, one nucleic acid sample from one individual issequenced. In certain embodiments, nucleic acids from each of two ormore samples are sequenced, where samples are from one individual orfrom different individuals. In certain embodiments, nucleic acid samplesfrom two or more biological samples are pooled, where each biologicalsample is from one individual or two or more individuals, and the poolis sequenced. In the latter embodiments, a nucleic acid sample from eachbiological sample often is identified by one or more unique identifiers.

In some embodiments, a sequencing method utilizes identifiers that allowmultiplexing of sequence reactions in a sequencing process. The greaterthe number of unique identifiers, the greater the number of samplesand/or chromosomes for detection, for example, that can be multiplexedin a sequencing process. A sequencing process can be performed using anysuitable number of unique identifiers (e.g., 4, 8, 12, 24, 48, 96, ormore).

A sequencing process sometimes makes use of a solid phase, and sometimesthe solid phase comprises a flow cell on which nucleic acid from alibrary can be attached and reagents can be flowed and contacted withthe attached nucleic acid. A flow cell sometimes includes flow celllanes, and use of identifiers can facilitate analyzing a number ofsamples in each lane. A flow cell often is a solid support that can beconfigured to retain and/or allow the orderly passage of reagentsolutions over bound analytes. Flow cells frequently are planar inshape, optically transparent, generally in the millimeter orsub-millimeter scale, and often have channels or lanes in which theanalyte/reagent interaction occurs. In some embodiments, the number ofsamples analyzed in a given flow cell lane is dependent on the number ofunique identifiers utilized during library preparation and/or probedesign. Multiplexing using 12 identifiers, for example, allowssimultaneous analysis of 96 samples (e.g., equal to the number of wellsin a 96 well microwell plate) in an 8-lane flow cell. Similarly,multiplexing using 48 identifiers, for example, allows simultaneousanalysis of 384 samples (e.g., equal to the number of wells in a 384well microwell plate) in an 8-lane flow cell. Non-limiting examples ofcommercially available multiplex sequencing kits include Illumina'smultiplexing sample preparation oligonucleotide kit and multiplexingsequencing primers and PhiX control kit (e.g., Illumina's catalognumbers PE-400-1001 and PE-400-1002, respectively).

Any suitable method of sequencing nucleic acids can be used,non-limiting examples of which include Maxim & Gilbert,chain-termination methods, sequencing by synthesis, sequencing byligation, sequencing by mass spectrometry, microscopy-based techniques,the like or combinations thereof. In some embodiments, afirst-generation technology, such as, for example, Sanger sequencingmethods including automated Sanger sequencing methods, includingmicrofluidic Sanger sequencing, can be used in a method provided herein.In some embodiments, sequencing technologies that include the use ofnucleic acid imaging technologies (e.g., transmission electronmicroscopy (TEM) and atomic force microscopy (AFM)), can be used. Insome embodiments, a high-throughput sequencing method is used.High-throughput sequencing methods generally involve clonally amplifiedDNA templates or single DNA molecules that are sequenced in a massivelyparallel fashion, sometimes within a flow cell. Next generation (e.g.,2nd and 3rd generation) sequencing techniques capable of sequencing DNAin a massively parallel fashion can be used for methods described hereinand are collectively referred to herein as “massively parallelsequencing” (MPS). In some embodiments, MPS sequencing methods utilize atargeted approach, where specific chromosomes, genes or regions ofinterest are sequenced. In certain embodiments, a non-targeted approachis used where most or all nucleic acids in a sample are sequenced,amplified and/or captured randomly.

In some embodiments a targeted enrichment, amplification and/orsequencing approach is used. A targeted approach often isolates, selectsand/or enriches a subset of nucleic acids in a sample for furtherprocessing by use of sequence-specific oligonucleotides. In someembodiments, a library of sequence-specific oligonucleotides areutilized to target (e.g., hybridize to) one or more sets of nucleicacids in a sample. Sequence-specific oligonucleotides and/or primers areoften selective for particular sequences (e.g., unique nucleic acidsequences) present in one or more chromosomes, genes, exons, introns,and/or regulatory regions of interest. Any suitable method orcombination of methods can be used for enrichment, amplification and/orsequencing of one or more subsets of targeted nucleic acids. In someembodiments targeted sequences are isolated and/or enriched by captureto a solid phase (e.g., a flow cell, a bead) using one or moresequence-specific anchors. In some embodiments targeted sequences areenriched and/or amplified by a polymerase-based method (e.g., aPCR-based method, by any suitable polymerase based extension) usingsequence-specific primers and/or primer sets. Sequence specific anchorsoften can be used as sequence-specific primers.

MPS sequencing sometimes makes use of sequencing by synthesis andcertain imaging processes. A nucleic acid sequencing technology that maybe used in a method described herein is sequencing-by-synthesis andreversible terminator-based sequencing (e.g., Illumina's GenomeAnalyzer; Genome Analyzer II; HISEQ 2000; HISEQ 2500 (Illumina, SanDiego CA)). With this technology, millions of nucleic acid (e.g., DNA)fragments can be sequenced in parallel. In one example of this type ofsequencing technology, a flow cell is used which contains an opticallytransparent slide with 8 individual lanes on the surfaces of which arebound oligonucleotide anchors (e.g., adapter primers).

Sequencing by synthesis generally is performed by iteratively adding(e.g., by covalent addition) a nucleotide to a primer or preexistingnucleic acid strand in a template directed manner. Each iterativeaddition of a nucleotide is detected and the process is repeatedmultiple times until a sequence of a nucleic acid strand is obtained.The length of a sequence obtained depends, in part, on the number ofaddition and detection steps that are performed. In some embodiments ofsequencing by synthesis, one, two, three or more nucleotides of the sametype (e.g., A, G, C or T) are added and detected in a round ofnucleotide addition. Nucleotides can be added by any suitable method(e.g., enzymatically or chemically). For example, in some embodiments apolymerase or a ligase adds a nucleotide to a primer or to a preexistingnucleic acid strand in a template directed manner. In some embodimentsof sequencing by synthesis, different types of nucleotides, nucleotideanalogues and/or identifiers are used. In some embodiments, reversibleterminators and/or removable (e.g., cleavable) identifiers are used. Insome embodiments fluorescent labeled nucleotides and/or nucleotideanalogues are used. In certain embodiments sequencing by synthesiscomprises a cleavage (e.g., cleavage and removal of an identifier)and/or a washing step. In some embodiments the addition of one or morenucleotides is detected by a suitable method described herein or knownin the art, non-limiting examples of which include any suitable imagingapparatus, a suitable camera, a digital camera, a CCD (Charge CoupleDevice) based imaging apparatus (e.g., a CCD camera), a CMOS(Complementary Metal Oxide Silicon) based imaging apparatus (e.g., aCMOS camera), a photo diode (e.g., a photomultiplier tube), electronmicroscopy, a field-effect transistor (e.g., a DNA field-effecttransistor), an ISFET ion sensor (e.g., a CHEMFET sensor), the like orcombinations thereof.

Any suitable MPS method, system or technology platform for conductingmethods described herein can be used to obtain nucleic acid sequencereads. Non-limiting examples of MPS platforms includeIllumina/Solex/HiSeq (e.g., Illumina's Genome Analyzer; Genome AnalyzerII; HISEQ 2000; HISEQ), SOLiD, Roche/454, PACBIO and/or SMRT, HelicosTrue Single Molecule Sequencing, Ion Torrent and Ion semiconductor-basedsequencing (e.g., as developed by Life Technologies), WildFire, 5500,5500×l W and/or 5500×l W Genetic Analyzer based technologies (e.g., asdeveloped and sold by Life Technologies, U.S. Patent ApplicationPublication No. 2013/0012399); Polony sequencing, Pyrosequencing,Massively Parallel Signature Sequencing (MPSS), RNA polymerase (RNAP)sequencing, LaserGen systems and methods, Nanopore-based platforms,chemical-sensitive field effect transistor (CHEMFET) array, electronmicroscopy-based sequencing (e.g., as developed by ZS Genetics, HalcyonMolecular), nanoball sequencing, the like or combinations thereof. Othersequencing methods that may be used to conduct methods herein includedigital PCR, sequencing by hybridization, nanopore sequencing,chromosome-specific sequencing (e.g., using DANSR (digital analysis ofselected regions) technology.

In some embodiments, nucleic acid is sequenced and the sequencingproduct (e.g., a collection of sequence reads) is processed prior to, orin conjunction with, an analysis of the sequenced nucleic acid. Forexample, sequence reads may be processed according to one or more of thefollowing: aligning, mapping, filtering, counting, normalizing,weighting, generating a profile, and the like, and combinations thereof.Certain processing steps may be performed in any order and certainprocessing steps may be repeated.

Mapping Reads

Sequence reads can be mapped and the number of reads mapping to aspecified nucleic acid region (e.g., a chromosome or portion thereof)are referred to as counts. In certain embodiments, sequence readscomprising overhang sequence information can be mapped and the number ofreads comprising overhang sequence information are mapping to aspecified nucleic acid region. Any suitable mapping method (e.g.,process, algorithm, program, software, module, the like or combinationthereof) can be used. Certain aspects of mapping processes are describedhereafter.

Mapping nucleotide sequence reads (i.e., sequence information from afragment whose physical genomic position is unknown) can be performed ina number of ways, and often comprises alignment of the obtained sequencereads with a matching sequence in a reference genome. In suchalignments, sequence reads generally are aligned to a reference sequenceand those that align are designated as being “mapped,” as “a mappedsequence read” or as “a mapped read.” In certain embodiments, a mappedsequence read is referred to as a “hit” or “count.” In some embodiments,mapped sequence reads are grouped together according to variousparameters and assigned to particular genomic portions, which arediscussed in further detail below.

The terms “aligned,” “alignment,” or “aligning” generally refer to twoor more nucleic acid sequences that can be identified as a match (e.g.,100% identity) or partial match. Alignments can be done manually or by acomputer (e.g., a software, program, module, or algorithm), non-limitingexamples of which include the Efficient Local Alignment of NucleotideData (ELAND) computer program distributed as part of the IlluminaGenomics Analysis pipeline. Alignment of a sequence read can be a 100%sequence match. In some cases, an alignment is less than a 100% sequencematch (i.e., non-perfect match, partial match, partial alignment). Insome embodiments an alignment is about a 99%, 98%, 97%, 96%, 95%, 94%,93%, 92%, 91%, 90%, 89%, 88%, 87%, 86%, 85%, 84%, 83%, 82%, 81%, 80%,79%, 78%, 77%, 76% or 75% match. In some embodiments, an alignmentcomprises a mismatch. In some embodiments, an alignment comprises 1, 2,3, 4 or 5 mismatches. Two or more sequences can be aligned using eitherstrand (e.g., sense or antisense strand). In certain embodiments anucleic acid sequence is aligned with the reverse complement of anothernucleic acid sequence.

Various computational methods can be used to map each sequence read to aportion. Non-limiting examples of computer algorithms that can be usedto align sequences include, without limitation, BLAST, BLITZ, FASTA,BOWTIE 1, BOWTIE 2, ELAND, MAQ, PROBEMATCH, SOAP, BWA or SEQMAP, orvariations thereof or combinations thereof. In some embodiments,sequence reads can be aligned with sequences in a reference genome. Insome embodiments, sequence reads can be found and/or aligned withsequences in nucleic acid databases known in the art including, forexample, GenBank, dbEST, dbSTS, EMBL (European Molecular BiologyLaboratory) and DDBJ (DNA Databank of Japan). BLAST or similar tools canbe used to search identified sequences against a sequence database.Search hits can then be used to sort the identified sequences intoappropriate portions (described hereafter), for example.

In some embodiments, a read may uniquely or non-uniquely map to portionsin a reference genome. A read is considered as “uniquely mapped” if italigns with a single sequence in the reference genome. A read isconsidered as “non-uniquely mapped” if it aligns with two or moresequences in the reference genome. In some embodiments, non-uniquelymapped reads are eliminated from further analysis (e.g. quantification).A certain, small degree of mismatch (0-1) may be allowed to account forsingle nucleotide polymorphisms that may exist between the referencegenome and the reads from individual samples being mapped, in certainembodiments. In some embodiments, no degree of mismatch is allowed for aread mapped to a reference sequence.

As used herein, the term “reference genome” can refer to any particularknown, sequenced or characterized genome, whether partial or complete,of any organism or virus which may be used to reference identifiedsequences from a subject. For example, a reference genome used for humansubjects as well as many other organisms can be found at the NationalCenter for Biotechnology Information at World Wide Web URLncbi.nlm.nih.gov. A “genome” refers to the complete genetic informationof an organism or virus, expressed in nucleic acid sequences. As usedherein, a reference sequence or reference genome often is an assembledor partially assembled genomic sequence from an individual or multipleindividuals. In some embodiments, a reference genome is an assembled orpartially assembled genomic sequence from one or more human individuals.In some embodiments, a reference genome comprises sequences assigned tochromosomes.

In certain embodiments, mappability is assessed for a genomic region(e.g., portion, genomic portion). Mappability is the ability tounambiguously align a nucleotide sequence read to a portion of areference genome, typically up to a specified number of mismatches,including, for example, 0, 1, 2 or more mismatches. For a given genomicregion, the expected mappability can be estimated using a sliding-windowapproach of a preset read length and averaging the resulting read-levelmappability values. Genomic regions comprising stretches of uniquenucleotide sequence sometimes have a high mappability value.

For paired-end sequencing, reads may be mapped to a reference genome byuse of a suitable mapping and/or alignment program, non-limitingexamples of which include BWA (Li H. and Durbin R. (2009) Bioinformatics25, 1754-60), Novoalign [Novocraft (2010)], Bowtie (Langmead B, et al.,(2009) Genome Biol. 10:R25), SOAP2 (Li R, et al., (2009) Bioinformatics25, 1966-67), BFAST (Homer N, et al., (2009) PLoS ONE 4, e7767), GASSST(Rizk, G. and Lavenier, D. (2010) Bioinformatics 26, 2534-2540), andMPscan (Rivals E., et al. (2009) Lecture Notes in Computer Science 5724,246-260), and the like. Paired-end reads may be mapped and/or alignedusing a suitable short read alignment program. Non-limiting examples ofshort read alignment programs include BarraCUDA, BFAST, BLASTN, BLAT,Bowtie, BWA, CASHX, CUDA-EC, CUSHAW, CUSHAW2, drFAST, ELAND, ERNE,GNUMAP, GEM, GensearchNGS, GMAP, Geneious Assembler, iSAAC, LAST, MAQ,mrFAST, mrsFAST, MOSAIK, MPscan, Novoalign, NovoalignCS, Novocraft,NextGENe, Omixon, PALMapper, Partek, PASS, PerM, QPalma, RazerS, REAL,cREAL, RMAP, rNA, RTG, Segemehl, SeqMap, Shrec, SHRiMP, SLIDER, SOAP,SOAP2, SOAP3, SOCS, SSAHA, SSAHA2, Stampy, SToRM, Subread, Subjunc,Taipan, UGENE, VelociMapper, TimeLogic, XpressAlign, ZOOM, the like orcombinations thereof. Paired-end reads are often mapped to opposing endsof the same polynucleotide fragment, according to a reference genome. Insome embodiments, read mates are mapped independently. In someembodiments, information from both sequence reads (i.e., from each end)is factored in the mapping process. A reference genome is often used todetermine and/or infer the sequence of nucleic acids located betweenpaired-end read mates. The term “discordant read pairs” as used hereinrefers to a paired-end read comprising a pair of read mates, where oneor both read mates fail to unambiguously map to the same region of areference genome defined, in part, by a segment of contiguousnucleotides. In some embodiments discordant read pairs are paired-endread mates that map to unexpected locations of a reference genome.Non-limiting examples of unexpected locations of a reference genomeinclude (i) two different chromosomes, (ii) locations separated by morethan a predetermined fragment size (e.g., more than 300 bp, more than500 bp, more than 1000 bp, more than 5000 bp, or more than 10,000 bp),(iii) an orientation inconsistent with a reference sequence (e.g.,opposite orientations), the like or a combination thereof. In someembodiments discordant read mates are identified according to a length(e.g., an average length, a predetermined fragment size) or expectedlength of template polynucleotide fragments in a sample. For example,read mates that map to a location that is separated by more than theaverage length or expected length of polynucleotide fragments in asample are sometimes identified as discordant read pairs. Read pairsthat map in opposite orientation are sometimes determined by taking thereverse complement of one of the reads and comparing the alignment ofboth reads using the same strand of a reference sequence. Discordantread pairs can be identified by any suitable method and/or algorithmknown in the art or described herein (e.g., SVDetect, Lumpy,BreakDancer, BreakDancerMax, CREST, DELLY, the like or combinationsthereof).

Sequence Read Quantification

Sequence reads that are mapped or partitioned based on a selectedfeature or variable can be quantified to determine the amount or numberof reads that are mapped to one or more portions (e.g., portion of areference genome). In some embodiments, sequence reads comprisingoverhang information that are mapped or partitioned based on a selectedfeature or variable can be quantified to determine the amount or numberof reads comprising overhang information that are mapped to one or moreportions. In certain embodiments, the quantity of sequence reads thatare mapped to a portion or segment is referred to as a count or readdensity.

A count often is associated with a genomic portion. In some embodimentsa count is determined from some or all of the sequence reads mapped to(i.e., associated with) a portion. In certain embodiments, a count isdetermined from some or all of the sequence reads mapped to a group ofportions (e.g., portions in a segment or region).

A count can be determined by a suitable method, operation ormathematical process. A count sometimes is the direct sum of allsequence reads mapped to a genomic portion or a group of genomicportions corresponding to a segment, a group of portions correspondingto a sub-region of a genome (e.g., copy number variation region, copynumber alteration region, copy number duplication region, copy numberdeletion region, microduplication region, microdeletion region,chromosome region, autosome region, sex chromosome region) and/orsometimes is a group of portions corresponding to a genome. A readquantification sometimes is a ratio, and sometimes is a ratio of aquantification for portion(s) in region a to a quantification forportion(s) in region b. Region a sometimes is one portion, segmentregion, copy number variation region, copy number alteration region,copy number duplication region, copy number deletion region,microduplication region, microdeletion region, chromosome region,autosome region and/or sex chromosome region. Region b independentlysometimes is one portion, segment region, copy number variation region,copy number alteration region, copy number duplication region, copynumber deletion region, microduplication region, microdeletion region,chromosome region, autosome region, sex chromosome region, a regionincluding all autosomes, a region including sex chromosomes and/or aregion including all chromosomes.

In some embodiments, a count is derived from raw sequence reads and/orfiltered sequence reads. In certain embodiments a count is an average,mean or sum of sequence reads mapped to a genomic portion or group ofgenomic portions (e.g., genomic portions in a region). In someembodiments, a count is associated with an uncertainty value. A countsometimes is adjusted. A count may be adjusted according to sequencereads associated with a genomic portion or group of portions that havebeen weighted, removed, filtered, normalized, adjusted, averaged,derived as a mean, derived as a median, added, or combination thereof.

A sequence read quantification sometimes is a read density. A readdensity may be determined and/or generated for one or more segments of agenome. In certain instances, a read density may be determined and/orgenerated for one or more chromosomes. In some embodiments a readdensity comprises a quantitative measure of counts of sequence readsmapped to a segment or portion of a reference genome. A read density canbe determined by a suitable process. In some embodiments a read densityis determined by a suitable distribution and/or a suitable distributionfunction. Non-limiting examples of a distribution function include aprobability function, probability distribution function, probabilitydensity function (PDF), a kernel density function (kernel densityestimation), a cumulative distribution function, probability massfunction, discrete probability distribution, an absolutely continuousunivariate distribution, the like, any suitable distribution, orcombinations thereof. A read density may be a density estimation derivedfrom a suitable probability density function. A density estimation isthe construction of an estimate, based on observed data, of anunderlying probability density function. In some embodiments a readdensity comprises a density estimation (e.g., a probability densityestimation, a kernel density estimation). A read density may begenerated according to a process comprising generating a densityestimation for each of the one or more portions of a genome where eachportion comprises counts of sequence reads. A read density may begenerated for normalized and/or weighted counts mapped to a portion orsegment. In some instances, each read mapped to a portion or segment maycontribute to a read density, a value (e.g., a count) equal to itsweight obtained from a normalization process described herein. In someembodiments read densities for one or more portions or segments areadjusted. Read densities can be adjusted by a suitable method. Forexample, read densities for one or more portions can be weighted and/ornormalized.

Reads quantified for a given portion or segment can be from one sourceor different sources. In one example, reads may be obtained from nucleicacid from a subject having cancer or suspected of having cancer. In suchcircumstances, reads mapped to one or more portions often are readsrepresentative of both healthy cells (i.e., non-cancer cells) and cancercells (e.g., tumor cells). In certain embodiments, some of the readsmapped to a portion are from cancer cell nucleic acid and some of thereads mapped to the same portion are from non-cancer cell nucleic acid.In another example, reads may be obtained from a nucleic acid samplefrom a pregnant female bearing a fetus. In such circumstances, readsmapped to one or more portions often are reads representative of boththe fetus and the mother of the fetus (e.g., a pregnant female subject).In certain embodiments some of the reads mapped to a portion are from afetal genome and some of the reads mapped to the same portion are from amaternal genome.

Assays

Techniques of the present disclosure can be used to perform a variety ofassays. In some cases, a sample can be assayed for some, many, or all ofthe overhangs present in the sample nucleic acids. This information canbe used to generate an overall overhang profile for the sample,indicating the number or frequency of the overhangs present. In somecases, a sample can be assayed for a panel of one or more particularoverhangs present in the sample. In some cases, a sample can be assayedfor one or more features of the overhangs present in the sample. In somecases, a sample can be assayed for bunt-ended fragments (e.g., targetnucleic acid (e.g., DNA) that is blunt-ended on one side or blunt-endedon both sides).

An overhang profile for a sample may be generated by analyzing and/orquantifying certain features of the overhangs present in the sample. Incertain instances, profiles may additionally or alternatively includefeatures of the target/template nucleic acids themselves (e.g., with orwithout overhang information). In certain instances, overhang profilesexclude features of the target/template nucleic acids. Thus, in certainembodiments, overhang profiles consist of overhang features.Overhang/template features may be analyzed or quantified using anysuitable quantification method, clustering method, statisticalalgorithm, classifier or model including, but not limited to, regression(e.g., logistic regression, linear regression, multivariate regression,least squares regression), hierarchical clustering (e.g., Ward'shierarchical clustering), supervised learning algorithm (e.g., supportvector machine (SVM)), multivariate model (e.g., principal componentanalysis (PCA)), linear discriminant analysis, quadratic discriminantanalysis, bagging, neural networks, support vector machine models,random forests, classification tree models, K-nearest neighbors, and thelike, and/or any suitable mathematical and/or statistical manipulation.

Overhang/template features that may be analyzed or quantified include,but are not limited to, dinucleotide count (e.g., presence/absence of aparticular dinucleotide in the overhang or read (e.g., number ofoverhangs in the sample having a particular dinucleotide, number oftemplate+overhangs in the sample having a particular dinucleotide, ornumber of template minus overhangs in the sample having a particulardinucleotide) and/or a count of the instances of a particulardinucleotide within an overhang or read); trinucleotide count (e.g.,presence/absence of a particular trinucleotide in the overhang or read(e.g., number of overhangs in the sample having a particulartrinucleotide, number of template+overhangs in the sample having aparticular trinucleotide, or number of template minus overhangs in thesample having a particular trinucleotide) and/or a count of theinstances of a particular trinucleotide within an overhang or read);tetranucleotide count (e.g., presence/absence of a particulartetranucleotide in the overhang or read (e.g., number of overhangs inthe sample having a particular tetranucleotide, number oftemplate+overhangs in the sample having a particular tetranucleotide, ornumber of template minus overhangs in the sample having a particulartetranucleotide) and/or a count of the instances of a particulartetranucleotide within an overhang or read); dinucleotide percent (e.g.,percent of overhangs in the sample having a particular dinucleotide,percent of template+overhangs in the sample having a particulardinucleotide, or percent of template minus overhangs in the samplehaving a particular dinucleotide; number of dinucleotides in theoverhang normalized by the overhang length; the proportion of thedinucleotide that is of that particular overhang; comparison across alloverhangs regardless of length); trinucleotide percent (e.g., percent ofoverhangs in the sample having a particular trinucleotide, percent oftemplate+overhangs in the sample having a particular trinucleotide, orpercent of template minus overhangs in the sample having a particulartrinucleotide; number of trinucleotides in the overhang normalized bythe overhang length; the proportion of the trinucleotide that is of thatparticular overhang; comparison across all overhangs regardless oflength); tetranucleotide percent (e.g., percent of overhangs in thesample having a particular tetranucleotide, percent oftemplate+overhangs in the sample having a particular tetranucleotide, orpercent of template minus overhangs in the sample having a particulartetranucleotide; number of tetranucleotides in the overhang normalizedby the overhang length; the proportion of the tetranucleotide that is ofthat particular overhang; comparison across all overhangs regardless oflength); full length of template; length category (e.g., for cfDNA:subnucleosome, mononucleosome, multinucleosome); overhang length (e.g.,1 base, 2 bases, 3 bases, 4 bases, 5 bases, 6 bases, 7 bases, 8 bases, 9bases, 10 bases, or more); overhang type (e.g., 5′ overhang, 3′overhang, blunt); GC content (e.g., overhang GC content,template+overhang GC content, or template minus overhang GC content);overhang percent (e.g., log 2 percent of overhang sequence/totaloverhangs); overhang count (e.g., counts of particular overhangsequence); percent length (e.g., length of overhang/full length oftemplate); dinucleotide count in overhang vs. entire sequence oftemplate molecule; trinucleotide count in overhang vs. entire sequenceof template molecule; tetranucleotide count in overhang vs. entiresequence of template molecule; Boolean variables which may includewhether an overhang overlaps with, is contained in, and/or starts orends in a particular region (e.g., coding regions, CpG islands,transcription factor binding sites (e.g., CCCTC-binding factor (CTCF)binding site), DNAse hypersensitive sites, sequences denoting openchromatin (e.g., ATAC-seq peaks); promoter regions, enhancer regions,hypermethylated regions, other regions of interest, and the like);genome coordinates; mean fragment length or distribution of moleculeswith a given overhang type and length; mean fragment length ordistribution of molecules with a given overhang sequence; delta betweenlibraries (e.g., identification of correlations in the data betweenvariables (e.g., detect correlation between X feature and Y feature suchas the mean of the fragment length distribution vs X variable (e.g.,mean length or distribution of fragments with a given overhang sequencevs. its X, where X=any feature/variable above))); and the like andcombinations thereof. Example dinucleotides include AA, AT, AC, AG, TT,TA, TC, TG, CC, CG, CA, CT, GG, GA, GC, and GT. Trinucleotides include4³ possible nucleotide combinations, and tetranucleotides include 4⁴possible nucleotide combinations. In some embodiments, presence of adinucleotide in the overhangs in a sample is analyzed. In someembodiments, presence of a CG dinucleotide in the overhangs in a sampleis analyzed. In some embodiments, presence of a GG dinucleotide in theoverhangs in a sample is analyzed. In some embodiments, presence of a GCdinucleotide in the overhangs in a sample is analyzed.

Overhang profiles, including overall overhang profiles, overhang panels,and overhang features, can be indicative of various characteristics of asample or a source (e.g., organism) from which a sample was taken. Thesecharacteristics can include, but are not limited to, nuclease activityand/or content, topoisomerase activity and/or content, disease (e.g.,cancer type, cancer stage, infection, organ disease or failure,neurodegenerative disease, ischemia, stroke, cardiovascular disease),cell death (e.g., increased or decreased rate of cell deathsystemically, increased or decreased rate of cell death in a particularorgan or cell type, increased or decreased rate of certain modes of celldeath (e.g., apoptosis, autophagy, necrosis, mitotic catastrophe,anoikis, cornification, excitotoxicity, ferroptosis, Walleriandegeneration, activation-induced cell death (AICD), ischemic cell death,oncosis, immunogenic cell death or apoptosis, pyroptosis), dysregulationof apoptosis or other cell death modes), microbiome profile (e.g., gutmicrobiome, blood microbiome, mouth microbiome, skin microbiome,environmental microbiome (such as soil microbiome, water microbiome)),and radiation exposure type and/or amount (e.g., ultraviolet (A and B),ionizing radiation (e.g., cosmic rays, alpha particles, beta particles,gamma rays, X-rays), neutron radiation). In some embodiments, overhangprofiles, including overall overhang profiles, overhang panels, andoverhang features, are indicative of cancer. In some embodiments,overhang profiles, including overall overhang profiles, overhang panels,and overhang features, are indicative of gastrointestinal cancer.

Overhang profiles, including overall overhang profiles overhang panels,and overhang features, can be indicative of nuclease (e.g., DNase)activity, such as endogenous nuclease activity. Nuclease (e.g., DNase)activity can be indicative of various characteristics of a sample or asource discussed herein, including but not limited to cancer. In somecases, the overhangs of naturally present nucleic acids in a sample canbe assayed. In some cases, nucleic acids (e.g., synthesized nucleicacids) can be introduced into a sample, where they can then be acted onby nucleases present in the sample. Use of a known nucleic acidpopulation can produce an overhang profile that is compared to thosefrom different samples. The different overhangs produced on the knownnucleic acids can be informative of the nuclease profile of the sample.Tissue-specific nuclease activity can be assayed in vitro. For example,cell lines from different organs, tissues, or cell types can be culturedand cell death can be induced, followed by an assay of overhangprofiles. Overhang profiles also can be assayed for a particular enzyme(e.g., nuclease) or group of enzymes. A particular enzyme or group ofenzymes can be used to digest a population of nucleic acids, and theresulting overhang profile can be assayed. For example,CRISPR/Cas-system proteins or other nucleic acid-guided nucleases can beassayed to determine the type of ends (e.g., blunt ends, 1-bp staggeredends, other overhangs) they produce. In some applications, overhangprofile assays may be used to monitor the efficacy of particulartreatments and targeted therapies that aim to alter the activity ofDNAse activity (e.g., vitamins C and K3; topoisomerase inhibitors usedin anti-cancer therapies; and the like).

In some cases, nucleases in a subject or a sample can be inhibited topreserve a particular overhang profile. For example, cellular processesmay produce one overhang profile (e.g., from lysis, cell death, and/orpost-mortem intracellular processes), while nucleases present outsidethe cell (e.g., in a bodily fluid such as blood) may further alter thefirst overhang profile of the cell. Nucleases, such as those outside thecell, can be inhibited or deactivated (e.g., temporarily) to preservethe initial overhang profile for assaying. Nuclease activity can beinhibited (e.g., with actin) ahead of the sample collection. In anexample, two populations of overhangs are assayed, those from diseasedcells (D) and those from healthy cells (H); after release of DNA fromthe cells, nucleases in the blood may further alter the overhangs,resulting in modified overhang populations D‘ and H’; inhibiting thenucleases (e.g., DNases) present in the blood can allow assaying ofoverhang populations that are not modified or are less modified (e.g., Dand H, or closer to D and H than would be observed without inhibition).Other enzymes affecting overhang profiles can also be inhibited. Forexample, topoisomerase excisions can cleave nucleic acids resulting inparticular overhang profiles. Topoisomerase inhibitors can be introducedto preserve these overhangs (e.g., by preventing re-ligation) to allowassaying of these profiles.

Overhang profiles can be assayed by a variety of techniques. Overhangscan be assayed by nucleic acid sequencing, including as disclosedherein. Overhangs can be assayed by binding or hybridization. Forexample, overhangs can be bound to binding agents that specificallyhybridize particular overhangs. Binding agents can be located on asubstrate, such as an array or a bead. Binding events can be detected(e.g., fluorescence or other optical signal, electrical signal) and theoverhang profile can be determined. Prior to an assay, or as part of anassay, particular species of nucleic acids (e.g., those with aparticular overhang or with one or more overhangs from a panel ofoverhangs) can be enriched, including as disclosed herein.

Classifications and Uses Thereof

Methods described herein can provide an outcome indicative of one ormore characteristics of a sample or source described above. Methodsdescribed herein sometimes provide an outcome indicative of a phenotypeand/or presence or absence of a medical condition for a test sample(e.g., providing an outcome determinative of the presence or absence ofa medical condition and/or phenotype). An outcome often is part of aclassification process, and a classification (e.g., classification ofone or more characteristics of a sample or source; and/or presence orabsence of a genotype, phenotype, genetic variation and/or medicalcondition for a test sample) sometimes is based on and/or includes anoutcome. An outcome and/or classification sometimes is based on and/orincludes a result of data processing for a test sample that facilitatesdetermining one or more characteristics of a sample or source and/orpresence or absence of a genotype, phenotype, genetic variation, geneticalteration, and/or medical condition in a classification process (e.g.,a statistic value). An outcome and/or classification sometimes includesor is based on a score determinative of, or a call of, one or morecharacteristics of a sample or source and/or presence or absence of agenotype, phenotype, genetic variation, genetic alteration, and/ormedical condition. In certain embodiments, an outcome and/orclassification includes a conclusion that predicts and/or determines oneor more characteristics of a sample or source and/or presence or absenceof a genotype, phenotype, genetic variation, genetic alteration, and/ormedical condition in a classification process.

Any suitable expression of an outcome and/or classification can beprovided. An outcome and/or classification sometimes is based on and/orincludes one or more numerical values generated using a processingmethod described herein in the context of one or more considerations ofprobability. Non-limiting examples of values that can be utilizedinclude a sensitivity, specificity, standard deviation, median absolutedeviation (MAD), measure of certainty, measure of confidence, measure ofcertainty or confidence that a value obtained for a test sample isinside or outside a particular range of values, measure of uncertainty,measure of uncertainty that a value obtained for a test sample is insideor outside a particular range of values, coefficient of variation (CV),confidence level, confidence interval (e.g., about 95% confidenceinterval), standard score (e.g., z-score), chi value, phi value, resultof a t-test, p-value, ploidy value, fitted minority species fraction,area ratio, median level, the like or combination thereof. In someembodiments, an outcome and/or classification comprises an overhangprofile, a read density, a read density profile and/or a plot (e.g., aprofile plot). In certain embodiments, multiple values are analyzedtogether, sometimes in a profile for such values (e.g., z-score profile,p-value profile, chi value profile, phi value profile, result of at-test, value profile, the like, or combination thereof). Aconsideration of probability can facilitate determining one or morecharacteristics of a sample or source and/or whether a subject is atrisk of having, or has, a genotype, phenotype, genetic variation and/ormedical condition, and an outcome and/or classification determinative ofthe foregoing sometimes includes such a consideration.

In certain embodiments, an outcome and/or classification is based onand/or includes a conclusion that predicts and/or determines a risk orprobability of the presence or absence of a genotype, phenotype, geneticvariation and/or medical condition for a test sample. A conclusionsometimes is based on a value determined from a data analysis methoddescribed herein (e.g., a statistics value indicative of probability,certainty and/or uncertainty (e.g., standard deviation, median absolutedeviation (MAD), measure of certainty, measure of confidence, measure ofcertainty or confidence that a value obtained for a test sample isinside or outside a particular range of values, measure of uncertainty,measure of uncertainty that a value obtained for a test sample is insideor outside a particular range of values, coefficient of variation (CV),confidence level, confidence interval (e.g., about 95% confidenceinterval), standard score (e.g., z-score), chi value, phi value, resultof a t-test, p-value, sensitivity, specificity, the like or combinationthereof). An outcome and/or classification sometimes is expressed in alaboratory test report for particular test sample as a probability(e.g., odds ratio, p-value), likelihood, or risk factor, associated withthe presence or absence of a genotype, phenotype, genetic variationand/or medical condition. An outcome and/or classification for a testsample sometimes is provided as “positive” or “negative” with respect aparticular genotype, phenotype, genetic variation and/or medicalcondition. For example, an outcome and/or classification sometimes isdesignated as “positive” in a laboratory test report for a particulartest sample where presence of a genotype, phenotype, genetic variationand/or medical condition is determined, and sometimes an outcome and/orclassification is designated as “negative” in a laboratory test reportfor a particular test sample where absence of a genotype, phenotype,genetic variation and/or medical condition is determined. An outcomeand/or classification sometimes is determined and sometimes includes anassumption used in data processing.

There typically are four types of classifications generated in aclassification process: true positive, false positive, true negative andfalse negative. The term “true positive” as used herein refers topresence of a genotype, phenotype, genetic variation, or medicalcondition correctly determined for a test sample. The term “falsepositive” as used herein refers to presence of a genotype, phenotype,genetic variation, or medical condition incorrectly determined for atest sample. The term “true negative” as used herein refers to absenceof a genotype, phenotype, genetic variation, or medical conditioncorrectly determined for a test sample. The term “false negative” asused herein refers to absence of a genotype, phenotype, geneticvariation, or medical condition incorrectly determined for a testsample. Two measures of performance for a classification process can becalculated based on the ratios of these occurrences: (i) a sensitivityvalue, which generally is the fraction of predicted positives that arecorrectly identified as being positives; and (ii) a specificity value,which generally is the fraction of predicted negatives correctlyidentified as being negative.

In certain embodiments, a laboratory test report generated for aclassification process includes a measure of test performance (e.g.,sensitivity and/or specificity) and/or a measure of confidence (e.g., aconfidence level, confidence interval). A measure of test performanceand/or confidence sometimes is obtained from a clinical validation studyperformed prior to performing a laboratory test for a test sample. Incertain embodiments, one or more of sensitivity, specificity and/orconfidence are expressed as a percentage. In some embodiments, apercentage expressed independently for each of sensitivity, specificityor confidence level, is greater than about 90% (e.g., about 90, 91, 92,93, 94, 95, 96, 97, 98 or 99%, or greater than 99% (e.g., about 99.5%,or greater, about 99.9% or greater, about 99.95% or greater, about99.99% or greater)). A confidence interval expressed for a particularconfidence level (e.g., a confidence level of about 90% to about 99.9%(e.g., about 95%)) can be expressed as a range of values, and sometimesis expressed as a range or sensitivities and/or specificities for aparticular confidence level. Coefficient of variation (CV) in someembodiments is expressed as a percentage, and sometimes the percentageis about 10% or less (e.g., about 10, 9, 8, 7, 6, 5, 4, 3, 2 or 1%, orless than 1% (e.g., about 0.5% or less, about 0.1% or less, about 0.05%or less, about 0.01% or less)). A probability (e.g., that a particularoutcome and/or classification is not due to chance) in certainembodiments is expressed as a standard score (e.g., z-score), a p-value,or result of a t-test. In some embodiments, a measured variance,confidence level, confidence interval, sensitivity, specificity and thelike (e.g., referred to collectively as confidence parameters) for anoutcome and/or classification can be generated using one or more dataprocessing manipulations described herein.

An outcome and/or classification for a test sample often is ordered by,and often is provided to, a health care professional or other qualifiedindividual (e.g., physician or assistant) who transmits an outcomeand/or classification to a subject from whom the test sample isobtained. In certain embodiments, an outcome and/or classification isprovided using a suitable visual medium (e.g., a peripheral or componentof a machine, e.g., a printer or display). A classification and/oroutcome often is provided to a healthcare professional or qualifiedindividual in the form of a report. A report typically comprises adisplay of an outcome and/or classification (e.g., a value, one or morecharacteristics of a sample or source, or an assessment or probabilityof presence or absence of a genotype, phenotype, genetic variationand/or medical condition), sometimes includes an associated confidenceparameter, and sometimes includes a measure of performance for a testused to generate the outcome and/or classification. A report sometimesincludes a recommendation for a follow-up procedure (e.g., a procedurethat confirms the outcome or classification). A report sometimesincludes a visual representation of a chromosome or portion thereof(e.g., a chromosome ideogram or karyogram), and sometimes shows avisualization of a duplication and/or deletion region for a chromosome(e.g., a visualization of a whole chromosome for a chromosome deletionor duplication; a visualization of a whole chromosome with a deletedregion or duplicated region shown; a visualization of a portion ofchromosome duplicated or deleted; a visualization of a portion of achromosome remaining in the event of a deletion of a portion of achromosome) identified for a test sample.

A report can be displayed in a suitable format that facilitatesdetermination of presence or absence of a genotype, phenotype, geneticvariation and/or medical condition by a health professional or otherqualified individual. Non-limiting examples of formats suitable for usefor generating a report include digital data, a graph, a 2D graph, a 3Dgraph, and 4D graph, a picture (e.g., a jpg, bitmap (e.g., bmp), pdf,tiff, gif, raw, png, the like or suitable format), a pictograph, achart, a table, a bar graph, a pie graph, a diagram, a flow chart, ascatter plot, a map, a histogram, a density chart, a function graph, acircuit diagram, a block diagram, a bubble map, a constellation diagram,a contour diagram, a cartogram, spider chart, Venn diagram, nomogram,and the like, or combination of the foregoing.

A report may be generated by a computer and/or by human data entry, andcan be transmitted and communicated using a suitable electronic medium(e.g., via the internet, via computer, via facsimile, from one networklocation to another location at the same or different physical sites),or by another method of sending or receiving data (e.g., mail service,courier service and the like). Non-limiting examples of communicationmedia for transmitting a report include auditory file, computer readablefile (e.g., pdf file), paper file, laboratory file, medical record file,or any other medium described in the previous paragraph. A laboratoryfile or medical record file may be in tangible form or electronic form(e.g., computer readable form), in certain embodiments. After a reportis generated and transmitted, a report can be received by obtaining, viaa suitable communication medium, a written and/or graphicalrepresentation comprising an outcome and/or classification, which uponreview allows a healthcare professional or other qualified individual tomake a determination as to one or more characteristics of a sample orsource, or presence or absence of a genotype, phenotype, geneticvariation and/or or medical condition for a test sample.

An outcome and/or classification may be provided by and obtained from alaboratory (e.g., obtained from a laboratory file). A laboratory filecan be generated by a laboratory that carries out one or more tests fordetermining one or more characteristics of a sample or source and/orpresence or absence of a genotype, phenotype, genetic variation and/ormedical condition for a test sample. Laboratory personnel (e.g., alaboratory manager) can analyze information associated with test samples(e.g., test profiles, reference profiles, test values, reference values,level of deviation, patient information) underlying an outcome and/orclassification. For calls pertaining to presence or absence of agenotype, phenotype, genetic variation and/or medical condition that areclose or questionable, laboratory personnel can re-run the sameprocedure using the same (e.g., aliquot of the same sample) or differenttest sample from a test subject. A laboratory may be in the samelocation or different location (e.g., in another country) as personnelassessing the presence or absence of a genotype, phenotype, geneticvariation and/or a medical condition from the laboratory file. Forexample, a laboratory file can be generated in one location andtransmitted to another location in which the information for a testsample therein is assessed by a healthcare professional or otherqualified individual, and optionally, transmitted to the subject fromwhich the test sample was obtained. A laboratory sometimes generatesand/or transmits a laboratory report containing a classification ofpresence or absence of genomic instability, a genotype, phenotype, agenetic variation and/or a medical condition for a test sample. Alaboratory generating a laboratory test report sometimes is a certifiedlaboratory, and sometimes is a laboratory certified under the ClinicalLaboratory Improvement Amendments (CLIA).

An outcome and/or classification sometimes is a component of a diagnosisfor a subject, and sometimes an outcome and/or classification isutilized and/or assessed as part of providing a diagnosis for a testsample. For example, a healthcare professional or other qualifiedindividual may analyze an outcome and/or classification and provide adiagnosis based on, or based in part on, the outcome and/orclassification. In some embodiments, determination, detection ordiagnosis of a medical condition, disease, syndrome or abnormalitycomprises use of an outcome and/or classification determinative ofpresence or absence of a genotype, phenotype, genetic variation and/ormedical condition. Thus, provided herein are methods for diagnosingpresence or absence of a genotype, phenotype, a genetic variation and/ora medical condition for a test sample according to an outcome orclassification generated by methods described herein, and optionallyaccording to generating and transmitting a laboratory report thatincludes a classification for presence or absence of the genotype,phenotype, a genetic variation and/or a medical condition for the testsample.

Machines, Software and Interfaces

Certain processes and methods described herein (e.g., selecting a subsetof reads, generating an overhang profile, processing overhang data,processing overhang quantifications, determining one or morecharacteristics of a sample based on overhang data or an overhangprofile) often cannot be performed without a computer, microprocessor,software, module or other machine. Methods described herein may becomputer-implemented methods, and one or more portions of a methodsometimes are performed by one or more processors (e.g.,microprocessors), computers, systems, apparatuses, or machines (e.g.,microprocessor-controlled machine).

Computers, systems, apparatuses, machines and computer program productssuitable for use often include, or are utilized in conjunction with,computer readable storage media. Non-limiting examples of computerreadable storage media include memory, hard disk, CD-ROM, flash memorydevice and the like. Computer readable storage media generally arecomputer hardware, and often are non-transitory computer-readablestorage media. Computer readable storage media are not computer readabletransmission media, the latter of which are transmission signals per se.

Provided herein are computer readable storage media with an executableprogram stored thereon, where the program instructs a microprocessor toperform a method described herein. Provided also are computer readablestorage media with an executable program module stored thereon, wherethe program module instructs a microprocessor to perform part of amethod described herein. Also provided herein are systems, machines,apparatuses and computer program products that include computer readablestorage media with an executable program stored thereon, where theprogram instructs a microprocessor to perform a method described herein.Provided also are systems, machines and apparatuses that includecomputer readable storage media with an executable program module storedthereon, where the program module instructs a microprocessor to performpart of a method described herein.

Also provided are computer program products. A computer program productoften includes a computer usable medium that includes a computerreadable program code embodied therein, the computer readable programcode adapted for being executed to implement a method or part of amethod described herein. Computer usable media and readable program codeare not transmission media (i.e., transmission signals per se). Computerreadable program code often is adapted for being executed by aprocessor, computer, system, apparatus, or machine.

In some embodiments, methods described herein (e.g., (e.g., selecting asubset of reads, generating an overhang profile, processing overhangdata, processing overhang quantifications, determining one or morecharacteristics of a sample based on overhang data or an overhangprofile) are performed by automated methods. In some embodiments, one ormore steps of a method described herein are carried out by amicroprocessor and/or computer, and/or carried out in conjunction withmemory. In some embodiments, an automated method is embodied insoftware, modules, microprocessors, peripherals and/or a machinecomprising the like, that perform methods described herein. As usedherein, software refers to computer readable program instructions that,when executed by a microprocessor, perform computer operations, asdescribed herein.

Machines, software and interfaces may be used to conduct methodsdescribed herein. Using machines, software and interfaces, a user mayenter, request, query or determine options for using particularinformation, programs or processes (e.g., processing overhang data,processing overhang quantifications, and/or providing an outcome), whichcan involve implementing statistical analysis algorithms, statisticalsignificance algorithms, statistical algorithms, iterative steps,validation algorithms, and graphical representations, for example. Insome embodiments, a data set may be entered by a user as inputinformation, a user may download one or more data sets by suitablehardware media (e.g., flash drive), and/or a user may send a data setfrom one system to another for subsequent processing and/or providing anoutcome (e.g., send sequence read data from a sequencer to a computersystem for overhang sequence processing; send processed overhang data toa computer system for further processing and/or yielding an outcomeand/or report).

A system typically comprises one or more machines. Each machinecomprises one or more of memory, one or more microprocessors, andinstructions. Where a system includes two or more machines, some or allof the machines may be located at the same location, some or all of themachines may be located at different locations, all of the machines maybe located at one location and/or all of the machines may be located atdifferent locations. Where a system includes two or more machines, someor all of the machines may be located at the same location as a user,some or all of the machines may be located at a location different thana user, all of the machines may be located at the same location as theuser, and/or all of the machine may be located at one or more locationsdifferent than the user.

A system sometimes comprises a computing machine and a sequencingapparatus or machine, where the sequencing apparatus or machine isconfigured to receive physical nucleic acid and generate sequence reads,and the computing apparatus is configured to process the reads from thesequencing apparatus or machine. The computing machine sometimes isconfigured to determine an outcome from the sequence reads (e.g., acharacteristic of a sample).

A user may, for example, place a query to software which then mayacquire a data set via internet access, and in certain embodiments, aprogrammable microprocessor may be prompted to acquire a suitable dataset based on given parameters. A programmable microprocessor also mayprompt a user to select one or more data set options selected by themicroprocessor based on given parameters. A programmable microprocessormay prompt a user to select one or more data set options selected by themicroprocessor based on information found via the internet, otherinternal or external information, or the like. Options may be chosen forselecting one or more data feature selections, one or more statisticalalgorithms, one or more statistical analysis algorithms, one or morestatistical significance algorithms, iterative steps, one or morevalidation algorithms, and one or more graphical representations ofmethods, machines, apparatuses, computer programs or a non-transitorycomputer-readable storage medium with an executable program storedthereon.

Systems addressed herein may comprise general components of computersystems, such as, for example, network servers, laptop systems, desktopsystems, handheld systems, personal digital assistants, computingkiosks, and the like. A computer system may comprise one or more inputmeans such as a keyboard, touch screen, mouse, voice recognition orother means to allow the user to enter data into the system. A systemmay further comprise one or more outputs, including, but not limited to,a display screen (e.g., CRT or LCD), speaker, FAX machine, printer(e.g., laser, ink jet, impact, black and white or color printer), orother output useful for providing visual, auditory and/or hardcopyoutput of information (e.g., outcome and/or report).

In a system, input and output components may be connected to a centralprocessing unit which may comprise among other components, amicroprocessor for executing program instructions and memory for storingprogram code and data. In some embodiments, processes may be implementedas a single user system located in a single geographical site. Incertain embodiments, processes may be implemented as a multi-usersystem. In the case of a multi-user implementation, multiple centralprocessing units may be connected by means of a network. The network maybe local, encompassing a single department in one portion of a building,an entire building, span multiple buildings, span a region, span anentire country or be worldwide. The network may be private, being ownedand controlled by a provider, or it may be implemented as an internetbased service where the user accesses a web page to enter and retrieveinformation. Accordingly, in certain embodiments, a system includes oneor more machines, which may be local or remote with respect to a user.More than one machine in one location or multiple locations may beaccessed by a user, and data may be mapped and/or processed in seriesand/or in parallel. Thus, a suitable configuration and control may beutilized for mapping and/or processing data using multiple machines,such as in local network, remote network and/or “cloud” computingplatforms.

A system can include a communications interface in some embodiments. Acommunications interface allows for transfer of software and databetween a computer system and one or more external devices. Non-limitingexamples of communications interfaces include a modem, a networkinterface (such as an Ethernet card), a communications port, a PCMCIAslot and card, and the like. Software and data transferred via acommunications interface generally are in the form of signals, which canbe electronic, electromagnetic, optical and/or other signals capable ofbeing received by a communications interface. Signals often are providedto a communications interface via a channel. A channel often carriessignals and can be implemented using wire or cable, fiber optics, aphone line, a cellular phone link, an RF link and/or othercommunications channels. Thus, in an example, a communications interfacemay be used to receive signal information that can be detected by asignal detection module.

Data may be input by a suitable device and/or method, including, but notlimited to, manual input devices or direct data entry devices (DDEs).Non-limiting examples of manual devices include keyboards, conceptkeyboards, touch sensitive screens, light pens, mouse, tracker balls,joysticks, graphic tablets, scanners, digital cameras, video digitizersand voice recognition devices. Non-limiting examples of DDEs include barcode readers, magnetic strip codes, smart cards, magnetic ink characterrecognition, optical character recognition, optical mark recognition,and turnaround documents.

In some embodiments, output from a sequencing apparatus or machine mayserve as data that can be input via an input device. In certainembodiments, overhang information (e.g., overhang features such aslength, type, sequence) may serve as data that can be input via an inputdevice. In certain embodiments, mapped sequence reads may serve as datathat can be input via an input device. In certain embodiments, nucleicacid fragment size (e.g., length) may serve as data that can be inputvia an input device. In certain embodiments, output from a nucleic acidcapture process (e.g., genomic region origin data) may serve as datathat can be input via an input device. In certain embodiments, acombination of nucleic acid fragment size (e.g., length) and output froma nucleic acid capture process (e.g., genomic region origin data) mayserve as data that can be input via an input device. In certainembodiments, simulated data is generated by an in silico process and thesimulated data serves as data that can be input via an input device. Theterm “in silico” refers to research and experiments performed using acomputer. In silico processes include, but are not limited to, mappingsequence reads and processing mapped sequence reads according toprocesses described herein.

A system may include software useful for performing a process or part ofa process described herein, and software can include one or more modulesfor performing such processes (e.g., sequencing module, logic processingmodule, data display organization module). The term “software” refers tocomputer readable program instructions that, when executed by acomputer, perform computer operations. Instructions executable by theone or more microprocessors sometimes are provided as executable code,that when executed, can cause one or more microprocessors to implement amethod described herein. A module described herein can exist assoftware, and instructions (e.g., processes, routines, subroutines)embodied in the software can be implemented or performed by amicroprocessor. For example, a module (e.g., a software module) can be apart of a program that performs a particular process or task. The term“module” refers to a self-contained functional unit that can be used ina larger machine or software system. A module can comprise a set ofinstructions for carrying out a function of the module. A module cantransform data and/or information. Data and/or information can be in asuitable form. For example, data and/or information can be digital oranalogue. In certain embodiments, data and/or information sometimes canbe packets, bytes, characters, or bits. In some embodiments, data and/orinformation can be any gathered, assembled or usable data orinformation. Non-limiting examples of data and/or information include asuitable media, pictures, video, sound (e.g. frequencies, audible ornon-audible), numbers, constants, a value, objects, time, functions,instructions, maps, references, sequences, reads, mapped reads, levels,ranges, thresholds, signals, displays, representations, ortransformations thereof. A module can accept or receive data and/orinformation, transform the data and/or information into a second form,and provide or transfer the second form to a machine, peripheral,component or another module. A microprocessor can, in certainembodiments, carry out the instructions in a module. In someembodiments, one or more microprocessors are required to carry outinstructions in a module or group of modules. A module can provide dataand/or information to another module, machine or source and can receivedata and/or information from another module, machine or source.

A computer program product sometimes is embodied on a tangiblecomputer-readable medium, and sometimes is tangibly embodied on anon-transitory computer-readable medium. A module sometimes is stored ona computer readable medium (e.g., disk, drive) or in memory (e.g.,random access memory). A module and microprocessor capable ofimplementing instructions from a module can be located in a machine orin a different machine. A module and/or microprocessor capable ofimplementing an instruction for a module can be located in the samelocation as a user (e.g., local network) or in a different location froma user (e.g., remote network, cloud system). In embodiments in which amethod is carried out in conjunction with two or more modules, themodules can be located in the same machine, one or more modules can belocated in different machine in the same physical location, and one ormore modules may be located in different machines in different physicallocations.

A machine, in some embodiments, comprises at least one microprocessorfor carrying out the instructions in a module. Sequence readquantifications (e.g., counts) and/or overhang data sometimes areaccessed by a microprocessor that executes instructions configured tocarry out a method described herein. Sequence read quantificationsand/or overhang data that are accessed by a microprocessor can be withinmemory of a system, and the counts and/or overhang data can be accessedand placed into the memory of the system after they are obtained. Insome embodiments, a machine includes a microprocessor (e.g., one or moremicroprocessors) which microprocessor can perform and/or implement oneor more instructions (e.g., processes, routines and/or subroutines) froma module. In some embodiments, a machine includes multiplemicroprocessors, such as microprocessors coordinated and working inparallel. In some embodiments, a machine operates with one or moreexternal microprocessors (e.g., an internal or external network, server,storage device and/or storage network (e.g., a cloud)). In someembodiments, a machine comprises a module (e.g., one or more modules). Amachine comprising a module often is capable of receiving andtransferring one or more of data and/or information to and from othermodules.

In certain embodiments, a machine comprises peripherals and/orcomponents. In certain embodiments, a machine can comprise one or moreperipherals or components that can transfer data and/or information toand from other modules, peripherals and/or components. In certainembodiments, a machine interacts with a peripheral and/or component thatprovides data and/or information. In certain embodiments, peripheralsand components assist a machine in carrying out a function or interactdirectly with a module. Non-limiting examples of peripherals and/orcomponents include a suitable computer peripheral, I/O or storage methodor device including but not limited to scanners, printers, displays(e.g., monitors, LED, LCT or CRTs), cameras, microphones, pads (e.g.,ipads, tablets), touch screens, smart phones, mobile phones, USB I/Odevices, USB mass storage devices, keyboards, a computer mouse, digitalpens, modems, hard drives, jump drives, flash drives, a microprocessor,a server, CDs, DVDs, graphic cards, specialized I/O devices (e.g.,sequencers, photo cells, photo multiplier tubes, optical readers,sensors, etc.), one or more flow cells, fluid handling components,network interface controllers, ROM, RAM, wireless transfer methods anddevices (Bluetooth, WiFi, and the like,), the world wide web (www), theinternet, a computer and/or another module.

Software often is provided on a program product containing programinstructions recorded on a computer readable medium, including, but notlimited to, magnetic media including floppy disks, hard disks, andmagnetic tape; and optical media including CD-ROM discs, DVD discs,magneto-optical discs, flash memory devices (e.g., flash drives), RAM,floppy discs, the like, and other such media on which the programinstructions can be recorded. In online implementation, a server and website maintained by an organization can be configured to provide softwaredownloads to remote users, or remote users may access a remote systemmaintained by an organization to remotely access software. Software mayobtain or receive input information. Software may include a module thatspecifically obtains or receives data (e.g., a data receiving modulethat receives sequence read data and/or mapped read data) and mayinclude a module that specifically processes the data (e.g., aprocessing module that processes received data (e.g., filters,normalizes, provides an outcome and/or report). The terms “obtaining”and “receiving” input information refers to receiving data (e.g.,sequence reads, mapped reads) by computer communication means from alocal, or remote site, human data entry, or any other method ofreceiving data. The input information may be generated in the samelocation at which it is received, or it may be generated in a differentlocation and transmitted to the receiving location. In some embodiments,input information is modified before it is processed (e.g., placed intoa format amenable to processing (e.g., tabulated)).

Software can include one or more algorithms in certain embodiments. Analgorithm may be used for processing data and/or providing an outcome orreport according to a finite sequence of instructions. An algorithmoften is a list of defined instructions for completing a task. Startingfrom an initial state, the instructions may describe a computation thatproceeds through a defined series of successive states, eventuallyterminating in a final ending state. The transition from one state tothe next is not necessarily deterministic (e.g., some algorithmsincorporate randomness). By way of example, and without limitation, analgorithm can be a search algorithm, sorting algorithm, merge algorithm,numerical algorithm, graph algorithm, string algorithm, modelingalgorithm, computational genometric algorithm, combinatorial algorithm,machine learning algorithm, cryptography algorithm, data compressionalgorithm, parsing algorithm and the like. An algorithm can include onealgorithm or two or more algorithms working in combination. An algorithmcan be of any suitable complexity class and/or parameterized complexity.An algorithm can be used for calculation and/or data processing, and insome embodiments, can be used in a deterministic orprobabilistic/predictive approach. An algorithm can be implemented in acomputing environment by use of a suitable programming language,non-limiting examples of which are C, C++, Java, Perl, Python, Fortran,and the like. In some embodiments, an algorithm can be configured ormodified to include margin of errors, statistical analysis, statisticalsignificance, and/or comparison to other information or data sets (e.g.,applicable when using a neural net or clustering algorithm).

In certain embodiments, several algorithms may be implemented for use insoftware. These algorithms can be trained with raw data in someembodiments. For each new raw data sample, the trained algorithms mayproduce a representative processed data set or outcome. A processed dataset sometimes is of reduced complexity compared to the parent data setthat was processed. Based on a processed set, the performance of atrained algorithm may be assessed based on sensitivity and specificity,in some embodiments. An algorithm with the highest sensitivity and/orspecificity may be identified and utilized, in certain embodiments.

In certain embodiments, simulated (or simulation) data can aid dataprocessing, for example, by training an algorithm or testing analgorithm. In some embodiments, simulated data includes hypotheticalvarious samplings of different groupings of sequence reads. Simulateddata may be based on what might be expected from a real population ormay be skewed to test an algorithm and/or to assign a correctclassification. Simulated data also is referred to herein as “virtual”data. Simulations can be performed by a computer program in certainembodiments. One possible step in using a simulated data set is toevaluate the confidence of identified results, e.g., how well a randomsampling matches or best represents the original data. One approach isto calculate a probability value (p-value), which estimates theprobability of a random sample having better score than the selectedsamples. In some embodiments, an empirical model may be assessed, inwhich it is assumed that at least one sample matches a reference sample(with or without resolved variations). In some embodiments, anotherdistribution, such as a Poisson distribution for example, can be used todefine the probability distribution.

A system may include one or more microprocessors in certain embodiments.A microprocessor can be connected to a communication bus. A computersystem may include a main memory, often random access memory (RAM), andcan also include a secondary memory. Memory in some embodimentscomprises a non-transitory computer-readable storage medium. Secondarymemory can include, for example, a hard disk drive and/or a removablestorage drive, representing a floppy disk drive, a magnetic tape drive,an optical disk drive, memory card and the like. A removable storagedrive often reads from and/or writes to a removable storage unit.Non-limiting examples of removable storage units include a floppy disk,magnetic tape, optical disk, and the like, which can be read by andwritten to by, for example, a removable storage drive. A removablestorage unit can include a computer-usable storage medium having storedtherein computer software and/or data.

A microprocessor may implement software in a system. In someembodiments, a microprocessor may be programmed to automatically performa task described herein that a user could perform. Accordingly, amicroprocessor, or algorithm conducted by such a microprocessor, canrequire little to no supervision or input from a user (e.g., softwaremay be programmed to implement a function automatically). In someembodiments, the complexity of a process is so large that a singleperson or group of persons could not perform the process in a timeframeshort enough for determining one or more characteristics of a sample.

In some embodiments, secondary memory may include other similar meansfor allowing computer programs or other instructions to be loaded into acomputer system. For example, a system can include a removable storageunit and an interface device. Non-limiting examples of such systemsinclude a program cartridge and cartridge interface (such as that foundin video game devices), a removable memory chip (such as an EPROM, orPROM) and associated socket, and other removable storage units andinterfaces that allow software and data to be transferred from theremovable storage unit to a computer system.

Provided herein, in certain embodiments, are systems, machines andapparatuses comprising one or more microprocessors and memory, whichmemory comprises instructions executable by the one or moremicroprocessors and which instructions executable by the one or moremicroprocessors are configured to generate an overhang profile ofnucleic acid overhangs of a population of nucleic acids in a sample,and, based on the overhang profile, determine one or morecharacteristics of the sample.

Provided herein, in certain embodiments, are systems, machines andapparatuses comprising one or more microprocessors and memory, whichmemory comprises instructions executable by the one or moremicroprocessors and which instructions executable by the one or moremicroprocessors are configured to analyze overhang informationassociated with overhang identification sequences that indicate presenceof an overhang for reverse sequence reads, thereby generating ananalysis, and omitting from the analysis overhang information associatedwith overhang identification sequences that indicate presence of anoverhang for forward sequence reads.

Provided herein, in certain embodiments, are machines comprising one ormore microprocessors and memory, which memory comprises instructionsexecutable by the one or more microprocessors and which memory comprisesoverhang data for nucleic acid overhangs of a population of nucleicacids in a sample, and which instructions executable by the one or moremicroprocessors are configured to generate an overhang profile of thenucleic acid overhangs, and, based on the overhang profile, determineone or more characteristics of the sample.

Provided herein, in certain embodiments, are machines comprising one ormore microprocessors and memory, which memory comprises instructionsexecutable by the one or more microprocessors and which memory comprisesforward sequence reads and reverse sequence reads generated by asequencing process, and which instructions executable by the one or moremicroprocessors are configured to analyze overhang informationassociated with overhang identification sequences that indicate presenceof an overhang for the reverse sequence reads, thereby generating ananalysis, and omitting from the analysis overhang information associatedwith overhang identification sequences that indicate presence of anoverhang for the forward sequence reads.

Provided herein, in certain embodiments, are non-transitorycomputer-readable storage media with an executable program storedthereon, where the program instructs a microprocessor to perform thefollowing: (a) access overhang data for nucleic acid overhangs of apopulation of nucleic acids in a sample, and (b) generate an overhangprofile of the nucleic acid overhangs, and (c) based on the overhangprofile, determine one or more characteristics of the sample.

Provided herein, in certain embodiments, are non-transitorycomputer-readable storage media with an executable program storedthereon, where the program instructs a microprocessor to perform thefollowing: (a) access forward sequence reads and reverse sequence readsgenerated by a sequencing process, and (b) analyze overhang informationassociated with overhang identification sequences that indicate presenceof an overhang for the reverse sequence reads, thereby generating ananalysis, and omitting from the analysis overhang information associatedwith overhang identification sequences that indicate presence of anoverhang for the forward sequence reads.

Kits

Provided in certain embodiments are kits. The kits may include anycomponents and compositions described herein (e.g., oligonucleotides,oligonucleotide components/regions, target nucleic acids, enzymes)useful for performing any of the methods described herein, in anysuitable combination. Kits may further include any reagents, buffers, orother components useful for carrying out any of the methods describedherein. For example, a kit may include one or more of a plurality orpool of oligonucleotide species, a kinase adapted to 5′ phosphorylatenucleic acids of a nucleic acid sample (e.g., a polynucleotide kinase(PNK)), a DNA ligase, a cleavage agent, an enzyme (e.g., polymerase)suitable for performing a fill-in and/or strand displacement reaction,and any combination thereof.

Components of a kit may be present in separate containers, or multiplecomponents may be present in a single container. Suitable containersinclude a single tube (e.g., vial), one or more wells of a plate (e.g.,a 96-well plate, a 384-well plate, and the like), and the like.

Kits may also comprise instructions for performing one or more methodsdescribed herein and/or a description of one or more componentsdescribed herein. For example, a kit may include instructions for usinga composition described herein to modify ends of nucleic acid fragmentsand/or to produce a nucleic acid library. Instructions and/ordescriptions may be in printed form and may be included in a kit insert.In some embodiments, instructions and/or descriptions are provided as anelectronic storage data file present on a suitable computer readablestorage medium, e.g., portable flash drive, DVD, CD-ROM, diskette, andthe like. A kit also may include a written description of an internetlocation that provides such instructions or descriptions.

EXAMPLES

The examples set forth below illustrate certain embodiments and do notlimit the technology.

Example 1: RNAse-Cleavable Hairpin Adapters

One form of damage in degraded DNA molecules (e.g., ancient DNA) is thedeamination of cytosine (C) into uracil. DNA sequencing adapters in theform of hairpin structure that contain a deoxyuridine in the looprequire the use of a uracil-DNA-glycosylase and an endonuclease to cutthe single strand. One potential consequence of using these enzymes iscleavage at damaged sites within DNA fragments of interest, renderingthe fragments inaccessible to library conversion. Certain substratessimilar to ancient DNA, such as, for example, circulating cell-free DNA,may accumulate damaged bases during the process of release and clearancefrom the body. This could limit the use of sequencing adapterscontaining deoxyuridine for circulating cell-free DNA librarypreparations. Use of hairpin adapters having RNA bases within the loopof the hairpin in conjunction with use of an RNAse during librarypreparation could obviate such challenges for DNA library preparation.Use of an RNAse during library preparation also can be useful forreducing certain contaminants prior to sequencing. Example hairpinadapter configurations are shown in FIG. 1A, FIG. 1B and FIG. 1C.

Methods

Sequencing libraries were prepared by ligating template DNA to a DNA/RNAadapter having a stem-loop structure, where the double-stranded stem wascreated by the complementarity of several DNA bases universal toIllumina sequencing adapters, and the single-stranded loop containedunique, non-complementary DNA bases of P5 and P7 (forward and reverse)primer sites. Three guanine RNA bases were positioned within loop, whichserved as a cleavage site for T1 RNAse. Following cleavage, adouble-stranded DNA molecule was formed that that could directly beamplified and sequenced.

A commercially available Illumina library preparation kit that performsA-tailing during end repair was used for comparison. The RNAse-cleavablehairpin adapters tested in this Example were designed to have a singleT-overhang to accommodate the A-tailed template and had a 5′ phosphate.

The performance of the RNAse-cleavable hairpin adapter was evaluated byreplicating the preparation of standard Illumina libraries, andsubstituting the RNAse-cleavable hairpin adapter for the commercialkit's uracil adapter. Four sources of DNA were used as template: 1)cell-free DNA isolated from blood plasma, 2) cell-free DNA isolated fromurine, 3) human formalin-fixed paraffin-embedded (FFPE) sample, and 4)DNA isolated from a 30 kya bison bone. Library prep was performed asdescribed in the manual and paired-end sequencing was performed on MiSequsing a v2 nano flow cell or v3 reagent kit.

Results

The results in Table 1 show that the RNAse-cleavable hairpin adapterperformed just as well as the commercially available uracil adapter,with most gains observed for the cell free DNA from urine and thefragmented FFPE sample. In six of the eight experiments below, theRNAse-cleavable hairpin adapter showed a slightly lower duplicationrate.

TABLE 1 Comparison of commercial kit adapter with RNA-cleavable adapter.Values in bold show the higher of the two comparisons. % human mapped ofall reads that are total reads (not not phiX or type sample phiX ordiscarded) discarded perc_dup kit urine1 822,307 32.49% 0.0921% RNAhairpin urine1 449,756 41.94% 0.0504% kit plasma1 3,166,878 91.27%0.0085% RNA hairpin plasma1 3,090,306 91.36% 0.0107% kit plasma23,281,333 89.33% 0.0364% RNA hairpin plasma2 3,148,270 89.63% 0.0249%kit plasma3 3,301,742 91.47% 0.0219% RNA hairpin plasma3 2,927,87491.61% 0.0134% kit plasma4 3,053,505 91.72% 0.0235% RNA hairpin plasma43,462,353 91.76% 0.0127% kit ffpe1-fragmented 170,776 93.46% 0.0000% RNAhairpin ffpe1-fragmented 165,126 97.20% 0.0006% kit ffpe1 30,566 83.12%0.0039% RNA hairpin ffpe1 20,300 83.46% 0.0118% kit bison 1,992,30115.57%  0.005% RNA hairpin bison 1,467,088 15.00%  0.005%

Example 2: Unique End Identifiers (UEIs) and Unique MolecularIdentifiers (UMIs)

Events such as DNA release, various forms of cell-death, and postmortemcellular processes are characterized by distinct morphological featuresand molecular pathways. Signals found in the DNA termini following adouble-strand break may reflect unique patterns of DNA degradation andmay provide information about causative processes and potentialpathological processes. To investigate this, RNAse-cleavable adapterswere designed to capture single-stranded overhangs when present innative DNA termini (see e.g., FIG. 1B).

Such adapters were generated by ligating synthetic DNA that contains atleast two parts: 1) a 5′ or 3′ single-stranded overhang of length N, and2) a unique end identifier (UEI) adjacent to the overhang. A UEI is adouble-stranded barcode that conveys the type and length of thatoverhang, if any. Generally, UEIs and UEI adapters are notphosphorylated to avoid hybridization and formation of dimers. In someinstances, UEIs and UEI adapters are phosphorylated.

In certain iterations of this hairpin design, as well as designs ofother adapters, a unique molecular identifier (UMI) also is includedadjacent to the UEI or elsewhere within the adapter structure (see e.g.,FIG. 1C). UMIs serve a different purpose than the UEIs in that theyallow an estimation of the number of unique starting molecules andevaluate the sensitivity of the ligation reaction.

Example 3: Double-Sided Oligos with Unique End Identifiers (UEIs) forTagging DNA Ends

Unique end identifiers (UEIs) also can be used for tagging DNA ends. Inorder to provide flexible options for other sequencing platforms ordownstream analyses, UEIs can function as stand-alone ligationcomponents (i.e., without sequencer-specific adapter sequences). Thisprocess encodes/keeps intact the native ends of overhanging DNA forconversion into any library type or analysis. The product of suchligation may be a double-stranded blunt ended molecule.

One design is depicted in FIG. 2A, which allows ligation on either sideof a UEI to a corresponding DNA overhang. This oligo, which may bereferred to as a double-sided UEI oligo, has internal uracil(s) (ordeoxyuridine) on forward and reverse strands which serve as cleavagesites for the Uracil-Specific Excision Reagent (USER) enzyme (i.e., amixture of uracil DNA glycosylase (UDG) and the DNA glycosylase-lyaseendonuclease VIII). Certain oligo designs include two to twelve randombases between UEIs, with one or two uracils. FIG. 2A shows exampleoligonucleotides, and an example workflow is illustrated in FIG. 2B.Briefly, after ligation, an enzyme cocktail cuts uracils at bothstrands, separating any unligated material or adjacent ligation product.After cleavage, a template DNA molecule remains with UEIs ligated onboth sides. Fill-in is performed to repair the nicks remaining in themolecule. After fill-in, a double-stranded, blunt-ended moleculeremains, ready for library preparation.

To test the performance of the above oligo design, double-sided UEIoligos were ligated (20× oligo:template ratio) to 5 ng of cell-free DNAisolated from plasma from two individuals. Often, two fractions ofcell-free DNA fragment sizes isolated from plasma were observed. Toaddress this, the DNA extract was fractionated to separate fragmentslarger than 500 bp (“high”) from those smaller than 500 bp (“low”).Library preparation was performed as described above, the template wasphosphorylated, double-sided UEI oligos were ligated, uracils werecleaved, and fill-in was performed. The resulting DNA product was thensubjected to library preparation using a NEB Ultra II librarypreparation kit for Illumina. Paired-end sequencing (2×150) wasperformed using MiSeq v2 Nano flowcell.

Results

This approach generated more adapter dimers compared to other designs,which lowered the sequencing output. Between 30,000 and 150,000 readsper library were generated. DNA that was mappable to the human genomewas the expected size distribution of cell-free DNA, around 167 bp, orthe length of DNA wrapped around a nucleosome (see FIG. 3 and FIG. 4 ).Very few reads from the “high” fractions of extract, APN 307 and APN310, were short enough to merge, and were well above the size of onenucleosome (FIG. 3 , panels C and F). The “high” fraction libraries hadtoo few data for any visual display (FIG. 4 , panels C and F). The “low”or “all” (i.e., not size fractionated) portions of the librariesgenerated libraries; however, most DNA fragments of intended size wereligated to only one double-sided UEI oligo or none of the double-sidedUEI oligonucleotides (FIG. 4 , panels A, B, D, E).

In certain instances, it is advantageous for each template DNA moleculeto receive two UEI oligonucleotides. Often, molecules with only oneligated UEI oligo or no ligated UEI oligo will still be converted into astandard library molecule downstream. For purposes of characterizing thenative ends of DNA, fragments having no UEI oligonucleotides are notuseful, and fragments with one UEI oligo are not ideal. To address thischallenge, biotinylated dNTPs are incorporated into the strand duringthe fill-in step. By immobilizing only those template fragments thathave been successfully filled-in (i.e., successfully ligated), DNAmolecules with no UEI oligo ligation events are excluded from downstreamprocessing. This approach is applicable to any design that prepares DNAfragments with UEI oligos (i.e., double-sided UEI oligos) beforesequencing preparation.

Example 4: Blocked One-Sided Oligos with Unique End Identifiers (UEIs)for Tagging DNA Ends

Rather than using a design that encourages ligation on either end of aUEI oligo, a single blocking, modified base is placed on the 3′ end of aUEI oligo to ensure ligation in a specific direction. One design, shownin FIG. 5 , blocks the 3′ end of the UEI oligos such that they areforced to ligate unidirectionally. An isodeoxy-base was selected as ablocker. Isodeoxy-G and isodeoxy-C have a hydrogen-bonding pattern thatis different than natural bases. As such, they cannot bond with anynatural base. Typically, isoG can only pair with isoC. By using only oneof these two modified bases, either isoG or isoC, no ligation orhybridization should occur on the ‘incorrect’ end of the UEI oligo,forcing the correct orientation of the ligation event.

Because only the template DNA is phosphorylated (and not the oligo), anick remains in the molecule on both the forward and reverse strandfollowing ligation. Fill-in using a strand displacing polymerasecompletes the double-stranded molecule, and removes the strand of theUEI oligo having the modified base (see FIG. 6 ).

Example 5: Dephosphorylation of Synthetic and Biological DNA

Library preparation described in certain Examples above omits aconventional end repair step, which generally chews back 3′ overhangsand fills-in 5′ overhangs, and prepares the template for A-tailing orblunt-end ligation. The methods described above typically do not includeuse of a nuclease or polymerase to prepare template DNA, and insteadphosphorylate the template with T4 PNK before ligation.

In certain instances, a pre-treatment was performed on oligos, includingadapters and controls, and/or template DNA to remove all 3′ and 5′terminal phosphates, recessed or otherwise. The phosphatase rSAP (1unit/1 pmol of DNA) was used for the pre-treatment.

Tapestation data in FIG. 7A, FIG. 7B and FIG. 7C demonstrate theimprovement of library generation after treating the cell-free DNAtemplate (FIG. 7B) and cell-free DNA and adapters (FIG. 7C) with rSAP.Improvements following rSAP treatments are considered as a reduction ofadapter dimer peak and an increase of DNA-sized peaks. FIG. 7A shows alibrary made of almost exclusively adapter dimer, a nearly complete lossof DNA when the template is not treated with rSAP.

Significant improvements also were observed when synthetic 50 bpdouble-stranded control oligos were dephosphorylated before the firststep with T4 PNK. This indicates that even when unphosphorylated oligosor adapters are purchased from a commercial outfit, some DNA termini arenot amenable to end modification, including phosophorylation with T4PNK, and thus perform poorly during ligation.

Example 6: Mate Pair Library

Genomic or other DNA that is longer than the recommended fragment lengthfor library conversion is typically sheared, either mechanically orenzymatically, before library preparation. After shearing, conventionallibrary preparation begins with an end repair step. Both shearing andend repair prevent access to, and thus observation of, the native DNAtermini.

A portion of cell-free DNA fragments can be larger than is ideal forgenerating usable sequencing data (e.g., above 500 bp to 700 bp). Thehigh molecular weight (HMW) portion of cell-free DNA containing largefragments may be the result of release following non-apoptotic celldeath or necrosis, and the ends of such fragments may provide a sourceof useful information. One design for retaining and characterizing thenative ends of long DNA fragments from HMW DNA, and successfullyconverting the preserved ends into a sequencing library, is amodification of a mate pair library or a circularization of DNAfragments.

The mate pair modification first includes ligating long, e.g., >500 bp,5′ phosphorylated DNA fragments to a pool of biotinylateddouble-stranded (ds) oligonucleotides. Each ds oligo is designed toinclude a long palindromic single-stranded overhang on one side (can beplaced as either a 5′ overhang or a 3′ overhang; shown in FIG. 8A as a5′ overhang), and on the other side, all combinations of 5′ or 3′single-stranded overhangs composed of a random sequence varied up tolength N, preceded by a unique sequence that identifies the length andtype (5′ or 3′) of the overhang, referred to as a unique end identifier(UEI). A few examples of oligonucleotides with varying 5′ and 3′overhang lengths are illustrated in FIG. 8A. The long palindromicsequence generally is not found or predicted in the human genome nor iscommonly observed in bacteria often associated with human microbiome.High Delta G values estimated for hairpin formation suggest thatself-dimers, rather than hairpins, will preferentially form, as desired,promoting circularization. Native 3′ and 5′ overhangs present on thedsDNA template, if any, will ligate to the available overhanging ends ofthe oligonucleotide set. Natively blunt dsDNA templates can ligate tothe blunt-ended oligonucleotides in the set. Self-dimers produced bylong palindromic sequences of complementary oligonucleotides permitscircularization of the long fragments in the presence of ligase. In someinstances, a polynucleotide kinase is used with or before the ligase torepair nicks and complete the double strand. In some instances, theoligonucleotides are prepared with a 5′ phosphate.

After oligonucleotide and template DNA is ligated and circularized, anexonuclease removes any non-circularized DNA or excess oligonucleotides.Circularized DNA is then sheared (e.g., using an ultrasonicator(Covaris) or by enzymatic fragmentation) to generate molecules suitablefor short read high throughput sequencers. Generally, only the DNA endsof interest are captured by immobilizing all DNA fragments that areligated to biotinylated oligonucleotides on streptavidin-coated beads.These steps reduce generating library molecules from pieces of highmolecular weight (HMW) DNA that did not originate from the native ends.After fractioning both low molecular weight (LMW) and molecular weight(HMW) nucleic acid after DNA extraction, the circularization strategyfor long DNA fragments can be paired with an approach, like thosedescribed above, for shorter DNA fragments. The datasets resulting fromthese two strategies can be bioinformatically merged or analyzedseparately for comparative purposes to explore how native DNA endsdiffer depending on overall DNA fragment length.

The mate-pair or circularization method is generally performed asfollows: 1) pre-treat template DNA with phosphatase; 2) phosphorylatetemplate DNA; 3) ligate overhanging UEI oligonucleotide pool totemplate; repair nicks if necessary; 4) treat with exonuclease; 5)shear; 6) immobilize biotinylated fragments on beads; 7) begin librarypreparation of choice.

For methods described above, if a particular overhanging end pattern issignificant for biological (e.g. human endogenous or tissue-specificnuclease function), biomedical (e.g. pathological or treatment-inducedcell death, tumor formation), or forensic (e.g. biological vs.taphonomic degradation) discovery, overhanging oligo/adapters withbuilt-in UEI sequences can be used as a targeted enrichment strategy,with or without biotin.

Example 7: Additional Adapter Examples

Additional adapter examples are shown in FIGS. 9A and 9B (panel A). FIG.9A shows an example method for attaching a unique end identifier (UEI)sequence in a first phase using strand-displacing polymerase, and asequencer-specific sequence (e.g., sequencing adapter) in a secondphase. Thus the adapters shown in FIG. 9A, panel A, do not containsequencer-specific adapter sequences (e.g., P5, P7). Panel A of FIG. 9Ashows a Y adapter (left) and a hairpin adapter (right) composed of aunique end identifier (UEI) sequence (shown in gray) and random sequence(shown in black). In some instances, the Y adapter is a cleaved versionof the hairpin adapter. The hairpin adapter includes a cleavage site(“X”), which can include one or more RNA nucleotides as described inExample 1. Panel B of FIG. 9A shows ligation of the adapters to a targetnucleic acid. The hairpin adapter ligation product can be cleaved at thecleavage site. After cleavage, the ligation product is the same as theY-adapter ligation product. Panel C of FIG. 9A shows a fill-in step atnicks with a strand-displacing polymerase to create a fullycomplementary double-stranded, blunt-ended fragment. Panel D of FIG. 9Ashows a nucleic acid fragment that is ready for any sequencing librarypreparation of choice (second phase).

FIG. 9B shows an example method for attaching a Y adapter or a hairpinadapter to the ends of a native nucleic acid fragment. Panel A shows a Yadapter (left) and a hairpin adapter (right) composed of an overhang, aunique end identifier (UEI) sequence (shown in gray), and primingsequences (priming sequence 1 (e.g., Illumina P5 priming sequence) andpriming sequence 2 (e.g., Illumina P7 priming sequence); priming regionshown in black). Panel B shows ligation of the adapters to a targetnucleic acid. Because the adapters are not phosphorylated, the ligationonly occurs at the 5′ end of the template, leaving nicks. Panel C showsthat the nicks are repaired once the 5′ adapter strand is phosphorylatedand ligates the 3′ end of the adapter. After nick repair, the hairpinadapter ligation product can be cleaved at the cleavage site. Aftercleavage, the ligation product is the same as the Y-adapter ligationproduct. This method generates a double-stranded nucleic acid fragmentthat is ready for any sequencing library preparation of choice (secondphase) and/or sequencer of choice, which may depend on the primingsequences used.

Example 8: Fragment Size Selection

Rather than perform size selection after library preparation usingbeads, gel excision or automated methods like the Pippen Prep machine, asize fractionation is performed on the DNA extract, in some instances,controlling the size of DNA that will be converted into librarymolecules. There are certain practical and biological motives forfractionation. Practically, fractionation reduces the presence offragments too long for efficient sequencing using certain sequencingplatforms (e.g., Illumina platforms). Biologically, fractionationseparates and retains fragments that could be the products of differingbiological processes. Cell-free DNA (cfDNA) fragment lengths generallyare around the size of one, two, or a few nucleosomes, ˜170, ˜340, ˜510bp and the fragments are generally considered a product of cellapoptosis. Larger fragments include genomic DNA (gDNA) contaminants, incertain instances. Larger fragments also may include cfDNA fragments(e.g., fragments larger than a nucleosome, for example fragments atdifferent stages of breakdown, or fragments derived from a process otherthan apoptosis (e.g., necrosis)).

Fractionation is performed using Solid Phase Reversible Immobilization(SPRI) beads to separate and retain both short and long DNA fragments,which are defined as <500 bp and >500 bp, respectively. Carboxylatedbeads are prepared in a solution of varying PEG-8000 (18%, 20% and 38%)and NaCl (0.5M, 1 M, 2M) concentrations. The DNA extract is incubatedwith the solution, and the beads are collected using a magnetic particleseparator. The supernatant is either discarded or retained, depending ondesired effect. After washing the beads with ethanol, DNA is releasedfrom beads into an elution buffer of neutral pH, like water, 10 mMTris-HCl, TE buffer, or TE with TWEEN-20 (TET).

The length of DNA fragments that will immobilize on SPRI beads in thesolution is dependent on the concentration of PEG, which translates tothe ratio of beads to DNA. Generally, the lower the ratio, the longerDNA fragments will be captured on beads, resulting in shorter DNAremaining in the supernatant. To retain the short fragments from thesupernatant, a higher ratio of beads to DNA is used. In one example,dual size selection was performed on plasma DNA extractions, and eachDNA extract was incubated with 0.4-0.5× ratio of SPRI beads solution(final 18% PEG, 1 M NaCl) for 15 minutes at room temperature (i.e., 20ul of DNA, 8 ul of beads). After 15 minutes, the supernatant wascollected and set aside. The beads were washed twice with 80% ethanoland eluted in 15 ul of TET buffer. This fraction generally excludesshort DNA fragments.

A 2× ratio of SPRI beads were added to the supernatant. Incubation,washing and elution were performed as described above. After elution,this fraction generally contained short DNA fragments. The shortfragments are then used in various Illumina overhanging librarypreparation methods, some of which are described in certain Examplesabove.

Example 9: Oligonucleotide Adapters with RNA Overhangs

This Example describes a method that uses RNA bases as the substrate forsingle-stranded overhangs in oligonucleotide adapters. An adapter withan RNA overhang can be structured for library preparation in numerousconfigurations, e.g., Y, hairpin, duplex, duplex with blockingmodifications (see e.g., FIG. 10A). Like all other iterations describedherein, a unique end identifier (UEI) indicating the length of overhangand type of overhang (e.g., 5′ overhang or 3′ overhang) is incorporatedinto the duplex portion of the oligonucleotide. In some instances,Illumina-specific adapter sequences (e.g., P5, P7) are included in theUEI-adapters. In some instances, Illumina-specific adapter sequences(e.g., P5, P7) are not included in the UEI-adapters.

Certain ligases or combinations of ligases, such as T4 RNA ligase 2 orSplintR® Ligase, can, under certain conditions, ligate RNA to DNA whenDNA is annealed to an RNA template. By creating adapters withsingle-stranded overhangs of RNA bases, hybridization with native DNAtemplate will result in a RNA-DNA duplex that can be ligated.

To address a potential problem of oligonucleotide adapters formingdimers (e.g., through overhang hybridization forming RNA-RNA duplexes),digestion of adapter dimers is performed using a ribonuclease (RNAse)that targets double-stranded RNA (dsRNA) structures. RNAse III is anexample RNAse that targets dsRNA structures. Most ribonucleases requirea long substrate to function well. However, shorter dsRNA adapter dimersare eliminated via digestion or cleavage provided the 5′ end (‘leader’)of the adapter design (e.g., minimum length of substrate and apermissive leader sequence) satisfies the canonical requirements for aparticular ribonuclease (e.g., RNAse III).

An example workflow includes the following components: 1)dephosphorylate DNA template; 2) phosphorylate DNA template; 3)hybridize template with a plurality of (completely or partially,depending on design) double-stranded DNA adapter oligonucleotide specieseach having a UEI and a single-stranded overhang with random RNA basesof length 1 to N; a blunt adapter (no overhang) also included; 4) ligatewith one or a combination of ligases; 5) if necessary, cleave thehairpin structure; 6) complete double-stranded molecule—nick seal orfill-in at nick using strand displacing polymerase, depending on adapterconfiguration; 7) SPRI purify to remove adapter dimers based on size,removing excess adapters and dimers under 100 bp; enzymatic digestionalso is used in certain instances; 8) continue to Illumina preparation,if necessary.

Example 10: Oligonucleotide Adapters for High Molecular Weight (HMW) DNA

This Example describes oligonucleotide adapters and methods forcollecting overhang information from high molecular weight (HMW) DNAutilizing short read next generation sequencing (NGS; e.g.,high-throughput sequencing).

Certain oligonucleotide adapters and methods described in the Examplesabove may be useful for obtaining information on the length andorientation of overhangs for double stranded DNA (dsDNA) that has alength of less than 500 bp (e.g., using short read NGS sequencers).Certain strategies discussed above rely on ligating a pool of barcoded Y(or hairpin) oligonucleotide adapters, containing random single-strandedN-mers at either the 5′ terminus or the 3′ terminus, onto dsDNA. Afternext generation sequencing a unique barcode present on each type ofadapter relays the correct length and orientation of the overhang, ifany, present on each DNA molecule. In certain protocols, onecomputational approach uses the barcode at the beginning of read 2 ofsequencing data (e.g., obtained using the Illumina platform), due to thespecifics of the molecular biology involved.

Generally, in the methods described above, dsDNA ends are unaltered. Useof short-read sequencers generally requires that high molecular weight(HMW) DNA (e.g., DNA greater than 500 bp in length) be sheared tosmaller fragment sizes prior to sequencing. Shearing of HMW DNAtypically results in native ends being lost. Provided below areoligonucleotide adapters and methods for sequencing DNA of any size,whether naked or bound to chromatin, on short read sequencers whileretaining information on the overhangs of the original DNA molecule.Such adapters and methods may be useful for high throughput methods foranalyzing overhang information of DNA naturally larger than about 500bp, for example, including but not limited to DNA from formalin-fixed,paraffin-embedded tissue (FFPE DNA), DNA damaged by in vivo and/or invitro endogenous means (UV, methylation, bulky adducts, and the like),and DNA from cell culture extracts. Other uses may include interrogatingmedically designed DNA damaging and chemotherapeutic agents, in cellculture in vitro or in vivo; a replacement for the current TUNEL assay;screening of novel nucleases; and the like.

A first method is shown in FIG. 11 . DNA molecules of their nativefragment length go through a partial sequencing library prep where anon-phosphorylated degenerate barcoded first adapter (e.g., P7 adapter)with overhang length information is ligated to phosphorylated genomicDNA (gDNA). The first adapter (e.g., P7 adapter) may be modified usingany suitable modification to discourage adapter dimers from occurringand preventing adapter chaining (see e.g., FIG. 12 ).

After an appropriate solid phase reversible immobilization (SPRI)clean-up to remove un-ligated adapters, ligated adapters go throughphosphorylation and nick repair as described in certain Examples above.If a partial first adapter (e.g., P7 adapter) strategy is used, theadapter is filled in (e.g., using Bst DNA polymerase). After anappropriate SPRI clean-up to remove adapter dimers, DNA is sheared to anappropriate sequencing length for short read sequencers using mechanicalor enzymatic methods. DNA molecules then undergo end repair andA-tailing using suitable end repair and A-tailing techniques. After Atailing, a modified second adapter (e.g., modified P5 adapter), with a5′ phosphorylation modification on the correct end and ligation blockingmodifications on the other free ends, is ligated to the remaining DNAfragments.

The library is then PCR amplified up using primers designed according tothe first and second adapters (e.g., a P5/P7 amplification strategy).This strategy ensures enrichment for the native DNA overhangs since onlymolecules with a modified first adapter (e.g., P7 adapter) are amplifiedin the final pool. This strategy with the modifications only on thefirst adapter (e.g., P7 adapter) also ensures enrichment for the correctligation event to be read on read 2 of an Illumina Sequencer as per thespecifics in the assay.

A second method is shown in FIG. 14 . This method of capturing ends ofhigh molecular weight DNA molecules may be performed on naked DNA or onchromatin bound DNA. First, a pool of overhanging Y-adapters is ligatedto free dsDNA ends. The overhanging Y-adapters may be blocked (e.g.,using C3 spacers; blocked modification is indicated by Xs in FIG. 14 ).DNA is then sheared (in case of chromatin bound DNA, the DNA is treatedwith proteinase K before shearing). After DNA is fragmented to a sizesuitable for sequencing, the library is completed by performing an endrepair step (with or without A-tailing) and ligating a specializedadapter (e.g., specialized P5 adapter; referred to as special shorty P5*in FIG. 14 ) to the newly free ends, in order to enrich for thecorrectly formed molecule. DNA that is shorter than is amenable toshearing will still make a library molecule—but will have normal (i.e.,not specialized) adapters and corresponding barcodes on both sides. Thecleaved products may undergo an end repair process that performsblunt-end repair and A-tailing. For this step a commercially availableenzyme mix may be used. Such enzyme mix may include a polynucleotidekinase that performs 5′ phosphorylation, a polymerase that performs 5′fill-in (e.g., T4 polymerase), an enzyme with 3′ to 5′ exonuclease(e.g., T4 polymerase), and a polymerase that performs A-tailing (e.g.,Taq polymerase).

The specialized adapter (e.g., specialized P5 adapter) is designed suchthat the complement to the long strand (e.g., P5) is long enough to stayannealed, but is too short to be amplified during index PCR and thusonly one strand will be properly formed and copied. Information from anoverhang is considered if it is on the P7 side. Accordingly, that strandis enriched. The specialized adapter (e.g., specialized P5 adapter) hasa unique 8 bp barcode to recognize the molecules that were once HMW. Thespecialized adapter may be blocked (indicated by Xs in FIG. 14 ),minimizing interactions in the wrong direction. In certain instances,the specialized adapters are phosphorylated and blocked with C3 spacers.One of the two strands may have a phosphorothioate backbone modificationbefore the T-overhang (see FIG. 14 ; phosphorothioate backbonemodification is indicated by an asterisk “*” in the pool of specializedP5 adapters).

An example specialized P5 adapter includes the following nucleotidesequences:

(SEQ ID NO: 1) 5′/5Phos/GGGTAGCAAGATCGGAA/3SpC3/3′ (SEQ ID NO: 2)5′/5SpC3/ACACTCTTTCCCTACACGACGCTCTTCCGATCTTGCTACC C*T 3′

Preliminary testing of this method was performed and the resultsindicated the method successfully captured the ends of high molecularweight DNA molecules.

Example 11: NGS Library Preparation Method to Characterize NativeTermini of Fragmented DNA

In this Example, a ligation-based next-generation sequencing (NGS)library approach is described that provides comprehensive informationabout the native state of fragmentary DNA termini. By omitting thestandard DNA end repair step, libraries generated using this method canencode the type of break at each molecule terminus using customsequencing adapters. The end result of this library preparation methodis a high-throughput NGS assay that provides genome-wide nucleotideresolution DNA fragmentation. This method for generatingIllumina-compatible double-stranded DNA (dsDNA) sequencing librariesintroduces into the sequencing adapter a unique identifier that encodesthe type (3′, 5′, or blunt) and length of single-stranded overhangs, ifany, present on each original template molecule, as well as length andsequence of the remaining overhang, if present, on each DNA fragment.The accuracy of this method is demonstrated using 1) a population ofcontrol oligos with known single-stranded overhangs; and 2) DNAdigestion products from specific restriction enzymes. The distributionof native termini of dsDNA fragments produced by common methods ofmechanical and enzymatic shearing using the Diagenode Bioruptor, NEBFragmentase, DNaseI, and Micrococcal nuclease also is described.Finally, using this method, it is shown that common procedures forcollecting human blood vary in their ability to protect circulatingcell-free DNA fragments from degradation by nucleases present in theblood.

Materials and Methods

Nucleic Acid Template Acquisition and Preparation

Synthetic control oligos (Table 2) were designed using a random sequencegenerator at 50% GC content; sequences matching any known organism inpublic databases were removed. Each control molecule (n=12) is a unique50 bp sequence of double-stranded DNA with one blunt-end, and one 3′ or5′ single-stranded overhang of random sequence, 1 to 6 nucleotides inlength. Because each control is a unique sequence, it serves as its ownbarcode indicating the structure of the oligo. Oligos were synthesizedusing standard desalting purification and duplexed by Integrated DNATechnologies (IDT); all random nucleotides were ‘hand-mixed’ to reducesynthesis bias. Control oligos were pooled together in an equimolarratio. Before adapter ligation, up to 1 pmol of pooled control oligoswere dephosphorylated in a 20 μl reaction using rapid Shrimp AlkalinePhosphatase (New England Biolabs) incubated at 37° C. for 30 minutesfollowed by a 10-minute heat inactivation at 65° C. Control oligos werethen 5′ phosphorylated by bringing the heat inactivated 20 μl ShrimpAlkaline Phosphatase reaction up to 40 μl using T4 Polynucleotide Kinase(New England Biolabs), supplemented with ATP. The phosphorylationreaction was carried out at 37° C. for 30 minutes followed by a30-minute heat inactivation step at 65° C. Oligos were then ready foradapter ligation. Oligo concentration was calculated by taking theoriginal input pmol divided by 40 μl.

TABLE 2 Synthetic oligo design Overhang SEQ SEQ type Sequence 1 ID NO:Sequence 2 ID NO: 3′ 1 bp CCATACTGTGGTCGTCACCTATTA  3ATGACATAGCCTACCTTTACGCGG 15 CCCCGCGTAAAGGTAGGCTATGTGGTAATAGGTGACGACCACAGTAT CATN₁ GG 3′ 2 bp GTGAATTGTTGATGTCCTGGGTGC  4GTCGTGAGGACAGCTTTTGGGACG 16 CTCGTCCCAAAAGCTGTCCTCACGAGGCACCCAGGACATCAACAATTC ACN₂ AC 3′ 3 bp GCTTCTCGAACCCGCGATCCGGC  5TCTAAATCAACCCATTATGCCGGAT 17 CGATCCGGCATAATGGGTTGATTTCGGCCGGATCGCGGGTTCGAGAA AGAN₃ GC 3′ 4 bp CGACACGGATATTCCATCAAGAGA  6ACATCATCACAGGGACCATAGGCC 18 CGGGCCTATGGTCCCTGTGATGATCGTCTCTTGATGGAATATCCGTGTC GTN₄ G 3′ 5 bp ACCTTGTGTGTTGCTGAAGCAAAG  7GTTCGCTGGTTAAAACGGTCACGC 19 CCGCGTGACCGTTTTAACCAGCGAGGCTTTGCTTCAGCAACACACAAG ACN₅ GT 3′ 6 bp ATTTTACCACGAGTTCCTTACGAC  8TACCTGCCTACCGTGGCATCACAG 20 GGCTGTGATGCCACGGTAGGCAGCCGTCGTAAGGAACTCGTGGTAAA GTAN₆ AT 5′ 1 bp N₁CGCTTTACGGGTCCTGGGCCG  9GGCCTCGATTTCTGCAAGGTATCG 21 GGGTGCGATACCTTGCAGAAATCCACCCCGGCCCAGGACCCGTAAAG GAGGCC CG 5′ 2 bp N₂AGGACTCTGCCGTCGACGAGTT 10ACTACGCACGTGATGCCGTGAATT 22 CGTTAATTCACGGCATCACGTGCGAACGAACTCGTCGACGGCAGAGTC TAGT CT 5′ 3 bp N₃ACCTCCGTCGCGCTATGTTCTG 11CCCACAGAACGGAGAAGGTCGAAT 23 TTGCATTCGACCTTCTCCGTTCTGGCAACAGAACATAGCGCGACGGAG TGGG GT 5′ 4 bp N₄ACAAGAGGAGCATCCGTATTAC 12AATGCTCTAAACGTAGGCGATATAG 24 CGCCTATATCGCCTACGTTTAGAGGCGGTAATACGGATGCTCCTCTTG CATT T 5′ 5 bp N₅GTAAATCCCACACAGCTGTCGG 13CTATTACGCCGTCCAATGACCATAT 25 CTTATATGGTCATTGGACGGCGTAAAGCCGACAGCTGTGTGGGATTTA ATAG C 5′ 6 bp N₆CCAGACAGCCATAGAGGTTACA 14TCTGCGAACTGATGCAAATTGCTAT 26 AGCATAGCAATTTGCATCAGTTCGGCTTGTAACCTCTATGGCTGTCTGG CAGA

NA12878 gDNA was purchased from the Coriell Institute for MedicalResearch, was prepared for adapter ligation in several ways. Mechanicalshearing: NA12878 was sheared to an average length of 350 bp using aBioruptor Pico (Diagenode) and manufacturer's instructions. Sheared DNAwas then size selected from 200-600 bp using a Pippen Prep dye free 2%gel (Sage Sciences) following manufacturer's instructions. Restrictionenzyme digest: 1 μg of NA12878 was digested in a 50 μl reaction using 10units of MluCl (New England Biolabs) at 37° C. for 1 hour. Digested DNAwas purified using 2×AMPURE beads (Beckman Coulter) followingmanufacturer's instructions. After purification DNA was size-selectedfrom 200-600 bp using a Pippen Prep dye free 2% gel (Sage Sciences) andmanufacturer's instructions. Enzymatic shearing: 1 μg of NA12878 wasdigested in a 20 μl reaction with NEBNext® dsDNA Fragmentase® (NewEngland Biolabs) at 37° C. for 25 minutes and stopped with 0.1 mM EDTA.The reaction was then brought up to 50 μl and purified as above. DNase1: 1 μg of NA12878 was digested in a 50 μl reaction using 0.01 units ofDNase I (New England Biolabs) at 37° C. for 10 minutes and stopped with0.1 mM EDTA; DNA was purified as above. Micrococcal nuclease: 1 μg ofNA12878 was digested in a 50 μl reaction using 2 units of Micrococcalnuclease (New England Biolabs) at 37° C. for 5 minutes and stopped with0.1 mM EDTA; DNA was purified as above.

All NA12878 reactions: After NA12878 gDNA was prepared using any of theabove methods, it was end prepared for adapter ligation bydephosphorylation followed by 5′ phosphorylation using the same protocoldetailed above for the control oligos.

For human plasma and cell-free DNA preparation, whole blood fromdeidentified donors was obtained for in-vitro investigational use fromthe Stanford Blood Center in Palo Alto, CA. Blood was drawn into one ofseveral tube types (Table 3). Blood plasma was extracted from wholeblood by spinning the blood collection tubes at 1800 g for 10 minutes at4° C. Without disturbing the cell layer, the supernatant was transferredto microfuge tubes under sterile conditions in 2 ml aliquots and spunagain at 16000 g for 10 minutes at 4° C. to remove cell debris, andstored at −80° C. as 1 ml aliquots. cfDNA was extracted from 1 ml plasmausing the Circulating Cell-free DNA kit (Qiagen) followingmanufacturer's protocol. Purified cfDNA was measured for double-strandedDNA (dsDNA) concentration using the QUANT-IT high sensitivity dsDNAAssay Kit and a Qubit Fluorometer (ThermoFisher). Purified cfDNA wasanalyzed for size distribution using the Agilent TapeStation 4200 andassociated D1000 and D5000 high sensitivity products. Cell-free DNA wasend prepared for adapter ligation by dephosphorylation followed by 5′phosphorylation using the same protocol detailed above for the controloligos.

TABLE 3 Blood collection tubes used in synthetic spike experiments BloodCollection Tube Anti-coagulant Nuclease inhibited Red top tube None No -Additional nucleases released during clotting Yellow top tube SodiumCitrate No - Citrate has no nuclease inhibition function Purple top tubePotassium EDTA Maybe - EDTA can inhibit nuclease function Streck DNAtube Potassium EDTA Yes - contains nuclease and cell lysis inhibitors

For the control-spike experiments, approximately 40 ml of whole bloodwas obtained per donor in five blood collection tubes (Table 3). Bloodfrom each tube was divided into three aliquots. To accurately evaluatethe effect of blood nucleases on overhang profile, a pool of controloligos (1 pmol total per ml of whole blood) was added under sterileconditions. In the case of serum tubes, because coagulation initiatesfrom the time of blood draw, the clot was separated at the start of theexperiment and the control oligo pool was added to 1 ml of thesupernatant prior to serum preparation. The plasma-oligo mixtures wereincubated for 0, 4, or 24 hours. Immediately following each time point,plasma extraction and cfDNA preparation were performed following theprotocol described above. Water and 1×PBS pH7.4 were used as negativecontrols, substituting for control oligos; DNA extractions wereperformed similar to the whole blood aliquots. The bead binding buffer,proteinase K and magnetic bead volumes were scaled according to theinput plasma volume. DNA end preparation of control-spiked cfDNA wasperformed as described above, followed by library preparation.

Adapter Ligation and Sequencing Library Preparation

Each adapter contains Illumina sequencer-specific priming sites and aUnique-End-Identifier (UEI)—a barcode sequence that indicates the lengthand identity (5′ or 3′) of the overhang, if any, present in the originalmolecule (Table 4). The adapters were synthesized using standarddesalting purification and duplexed by Integrated DNA Technologies(IDT). For purposes of this study the 13 adapter set includes six with3′ overhangs (1 to 6 nt in length), six with 5′ overhangs (1-6 nt inlength), and a single blunt adapter (i.e., no overhang). Adapters werenot phosphorylated and thus were discouraged from forming dimers. All 13duplexed adapters were pooled in equimolar ratio and prepared forligation by end dephosphorylation using the following 20 μl reaction: 1pmol of pooled adapters, 10 units of rapid Shrimp Alkaline Phosphatase(New England Biolabs), 1× Cutsmart Buffer, incubated at 37° C. for 30minutes followed by a 10 minute heat inactivation at 65° C. Multipledephosphorylation reactions were combined over a single QIAQUICKNucleotide Removal column (Qiagen) and purified according tomanufacturer's instructions. Adapter molarity was calculated using DNAconcentration (Qubit Fluorometric Quantitation) and known length.Adapters were then ready for ligation.

TABLE 4 Adapter design SEQ SEQ Adapter Sequence 1 ID NO: Sequence 2ID NO: Blunt /5SpC3/ACACTCTTTCCCTACACGA 27 gcggtatAGATCGGAAGAGCACACGT 40CGCTCTTCCGATCTataccgc CTGAACTCCAGTCAC/3SPC3/ 3′ 1 bp/5SpC3/ACACTCTTTCCCTACACGA 28 cgatatcAGATCGGAAGAGCACACGT 41CGCTCTTCCGATCTgatatcg*N₁ CTGAACTCCAGTCAC/3SPC3/ 3′ 2 bp/5SpC3/ACACTCTTTCCCTACACGA 29 gtcagacAGATCGGAAGAGCACACG 42CGCTCTTCCGATCTgtctgacN₁*N₁ TCTGAACTCCAGTCAC/3SPC3/ 3′ 3 bp/5SpC3/ACACTCTTTCCCTACACGA 30 ttggctcAGATCGGAAGAGCACACGT 43CGCTCTTCCGATCTgagccaaN₂*N₁ CTGAACTCCAGTCAC/3SPC3/ 3′ 4 bp/5SpC3/ACACTCTTTCCCTACACGA 31 tatggcgAGATCGGAAGAGCACACGT 44CGCTCTTCCGATCTcgccataN₃*N₁ CTGAACTCCAGTCAC/3SPC3/ 3′ 5 bp/5SpC3/ACACTCTTTCCCTACACGA 32 atatacgAGATCGGAAGAGCACACGT 45CGCTCTTCCGATCTcgtatatN₄*N₁ CTGAACTCCAGTCAC/3SPC3/ 3′ 6 bp/5SpC3/ACACTCTTTCCCTACACGA 33 cttagtcAGATCGGAAGAGCACACGT 46CGCTCTTCCGATCTgactaagN₅*N₁ CTGAACTCCAGTCAC/3SPC3/ 5′ 1 bp/5SpC3/ACACTCTTTCCCTACACGA 34 N₁*ccgtactAGATCGGAAGAGCACAC 47CGCTCTTCCGATCTagtacgg GTCTGAACTCCAGTCAC/3SPC3/ 5′ 2 bp/5SpC3/ACACTCTTTCCCTACACGA 35 N₁*N₁cgctgctAGATCGGAAGAGCAC 48CGCTCTTCCGATCTagcagcg ACGTCTGAACTCCAGTCAC/3SPC3/ 5′ 3 bp/5SpC3/ACACTCTTTCCCTACACGA 36 N₁*N₂catatggAGATCGGAAGAGCAC 49CGCTCTTCCGATCTccatatg ACGTCTGAACTCCAGTCAC/3SPC3/ 5′ 4 bp/5SpC3/ACACTCTTTCCCTACACGA 37 N₁*N₃ccaggctAGATCGGAAGAGCAC 50CGCTCTTCCGATCTagcctgg ACGTCTGAACTCCAGTCAC/3SPC3/ 5′ 5 bp/5SpC3/ACACTCTTTCCCTACACGA 38 N₁*N₄cgcgtatAGATCGGAAGAGCAC 51CGCTCTTCCGATCTatacgcg ACGTCTGAACTCCAGTCAC/3SPC3/ 5′ 6 bp/5SpC3/ACACTCTTTCCCTACACGA 39 N₁*N₅gcctagcAGATCGGAAGAGCAC 52CGCTCTTCCGATCTgctaggc ACGTCTGAACTCCAGTCAC/3SPC3/

Adapter ligation included an initial ligation step followed by asubsequent nick repair ligation step prior to index PCR. 0.05 pmol ofsubstrate DNA (control/NA12878/cfDNA) was combined with 1 pmol ofadapters in a 60 μl ligation reaction with 800 units of T4 DNA ligase(New England Biolabs) and incubated at 20° C. for 1 hour, followed byeither a 2×AMPURE clean for control oligos, or a 1.2×AMPURE clean forNA12878 or cfDNA. After DNA purification, DNA was phosphorylated with 20units of T4 Polynucleotide Kinase (New England Biolabs) and 1×T4 DNAligase buffer in a 48.8 μl reaction and incubated at 37° C. After 30minutes, 480 units of T4 DNA ligase was added to the reaction and thetemperature reduced to 20° C. for 15 minutes. Nick repair was followedby a 2×AMPURE bead clean and elution in 20 μl of low TE (10 mM Tris pH8, 0.1 mM EDTA).

For index PCR, 10 μl of purified adapter-ligated DNA was combined with1× Kapa HiFi HotStart ReadyMix (Roche) and 0.4 mM final concentration ofIS4 and 0.4 mM final concentration of an index primer2 in a 50 μlreaction and amplified using the following thermal cycling conditions: 3minutes at 98° C. for initial denaturation followed by 15 cycles forcontrol/NA12878 or 18 cycles for cfDNA at 98° C. for 20 seconds, 68° C.for 30 seconds, 72° C. for 30 seconds, and finally an elongation step of1 minute at 72° C. After index PCR, DNA was purified with either a1.5×AMPURE clean for control oligos, or a 1.2×AMPURE clean forNA12878/cfDNA. For each sequencing DNA library, final molarity estimateswere calculated using fragment length distribution and dsDNAconcentration (Agilent Tapestation 4200 and Qubit FluorometricQuantitation unit). Samples were then pooled and run 2×150 bp cycles onan Illumina MISEQ benchtop sequencer (following manufacturer'sinstructions) to a depth of approximately 100,000 read-pairs per sample.

Informatic Analysis

Mapping UEI-barcoded read pairs poses a bioinformatic challenge whentemplate molecules are shorter than the sum of the lengths of theforward and reverse reads plus a single 7-nt barcode. This challengeexists because each read can extend through its mate's barcode sequenceand possibly beyond into the Illumina adapter sequence. One approach instudies where short template molecules are expected, such as in thefield of ancient DNA, is to simultaneously remove adapter sequences andmerge reads. This process includes collapsing forward and reverse readsinto single sequences, based on sequence similarity, while trimming endsof reads that match known Illumina adapter sequences using SEQPREP(github.com/jstjohn/SeqPrep). When UEIs are present, however, thesemerged reads can have a 7-nt UEI on both ends, one of which will bereverse-complemented.

To simplify mapping, adapter trimming and read merging were conditionedon the presence of UEI sequences. For each read, the presence of a knownUEI on both the forward and reverse read in each pair was checked. UEIswere allowed to contain up to one “N” base, but no other base mismatcheswere allowed. If both reads had a known UEI sequence, it was checkedwhether reads merged by searching each sequence for the reversecomplement of its mate's UEI. If neither read met this criterion, bothreads were output unchanged, since a read can only include adaptersequence if it extends through its mate's UEI sequence. If both readscontained their mate's reverse-complemented UEI sequence, and thepositions at which the mates' UEIs were encountered matched, then bothreads were truncated at the position where their mates' UEIs wereencountered. If the positions did not match, both reads were discarded.

Rather than storing all merged read pairs as collapsed sequences, theywere kept as truncated read pairs, so that UEI sequences of mates wouldnot interfere with mapping to reference genomes. For the sake of controloligo experiments, in which relatively short sequences were expected,collapsed sequences for read pairs that merged using the above criteriaalso were stored. For such sequences, the bases within the merged regionwere allowed to contain at most one mismatch (the chosen base atmismatching positions was the base with the higher quality, or a randombase in the case of a tie).

To reduce the risk of contamination of the sequencing data by theIllumina sequencing control DNA—phiX—due to index misassignment, all ofthe raw data was first aligned to the phiX genome using bwa mem withdefault parameters. Reads that did not map (samtools fastq -f 12) wereextracted and used for downstream analyses.

Because it was found that overhanging adapters were less reliable whenencountered on forward (P5) rather than reverse (P7) reads, the analysesignored forward reads that began with an overhanging adapter. Bluntadapters were allowed on both the forward and reverse reads. In allcases, this filtering step was applied only when computing results (allreads were included when processing, merging, and aligning, butoverhanging adapters on forward reads were not allowed to affectresults).

A code was used that identifies overhangs using “Unique EndIdentifying-UEI” barcodes. The algorithm included the followingfeatures:

-   -   1. a data structure that contains a list of UEI barcode        sequences that indicate the type and length of an overhang, or a        blunt end;    -   2. take first 7 bases of each read (7=length of barcode);    -   3. see if these match a known barcode;    -   4. if there is one N, see if converting it to a base makes it        match a known barcode;    -   5. if it matches a barcode, look up that barcode's overhang by        taking from the read the number of bases indicated by the        barcode; and    -   6. ignore forward reads, unless barcode is blunt.

Control Oligo Experiments

Control oligos were short (50 bp) sequences of synthetic double-strandedDNA, one end of which was synthesized to have a single-stranded overhangand the other end of which was intended to be blunt. When processing,all properly formed sequences were expected to merge using the criteriaabove, except in cases where control oligos chained together. Two waysof assessing control oligo experiments were defined, one to measuresensitivity and the other to measure specificity.

To measure sensitivity, the percent of legitimate (non-chained) controloligo ends that were correctly identified using the adapters wascomputed. First, all reads that merged using the criteria above wereconsidered. When a mismatching base was encountered while merging a readpair, the base with the higher quality score was chosen; if qualityscores were tied, the base was chosen randomly. Next, a referencesequence comprising all control oligo sequences and their reversecomplements was constructed, separated by runs of “N” bases equal inlength to the longest control oligo overhang. To determine the controloligo type of each merged read, merged reads were aligned to thisreference sequence using the Edlib C++ sequence alignment library,allowing gaps at the beginning and end of the read in the alignment andallowing up to one base mismatch, letting “N” match any base with nopenalty. If the best alignment fell within the coordinates of a singlecontrol oligo sequence (a non-chimeric alignment), that control oligowas chosen as the correct sequence. A control oligo was consideredcorrect if the barcode for the correct overhang was ligated to theoverhang end of the oligo and the barcode for blunt adapters was ligatedto the opposite end.

To measure specificity, the percent of UEI sequences that were ligatedto the correct end of the control oligo with the matching overhang wascomputed. In this case, it was not assessed whether control oligosformed chains, thereby assessing any DNA end available for ligation. Forevery paired-end read (truncated as described above, but not merged),the sequence following the UEI was aligned to a reference sequencecontaining all control oligo sequences, separated by runs of “N” basesequal to the length of the longest overhang. The best alignment,allowing up to one mismatch and with “N” matching any base, was used todetermine the correct control oligo sequence, if the alignment wasnon-chimeric (within the coordinates of a single control oligosequence). Specificity was then defined as the percent of reads forwhich the UEI at the beginning of the read was followed by the correcttype of control oligo end, in the correct orientation.

For determining nucleotide composition of overhang sequence, all basesbetween the end of a UEI sequence and the beginning of a control oligosequence were considered to be the true sequence of the overhang. Whenassessing the base composition of overhang sequences, all adapters wererequired to be ligated to the correct type of control oligo.

Human DNA

Paired-end reads that remain after filtering were truncated if necessaryand aligned to the hg19 human reference genome downloaded from the UCSCgenome browser. For alignment, bwa aln and bwa sampe with defaultparameters were used, skipping the UEI sequences at the beginning of thereads (−B parameter). Duplicate reads were then removed using samtoolsrmdup. Reads were counted as mapped only when in proper pairs with aminimum map quality of 20 (samtools view -c -f66 -q20), except in thecase of the restriction enzyme experiments, in which the requirement forproper pairing (samtools view -c -f64 -q20) was removed due to thepossibility of chaining fragments causing chimeric alignments.

To count UEI types in mapped reads, the BAM files were scanned usingHTSLib's BAM parser and obtained UEI sequences from the BC tag. Overhangsequences were obtained by taking a number of bases from the beginningof each read equal to the overhang length indicated by the UEI.

Some sequencing libraries contained human DNA spiked with controloligos. To analyze these libraries, all sequencing reads were firstprocessed as if the libraries contained only human DNA. Then, non-humansequences were extracted from the alignments to the human referencegenome, by selecting unmapped reads and reads with map quality less than10 (using a custom technique that can re-append barcodes to extractedread sequences, unlike samtools fastq). These reads, which were mostlycontrol oligos, were then processed the same way as other control oligolibraries.

Results

Library Construction

The method in this Example assayed fragmented and degraded dsDNA terminifollowing a library preparation workflow. Each adapter for this methodincluded three parts: P5/P7 Illumina-based sequencing and index primingsites, a 7 base pair (bp) Unique End Identifier (UEI; a barcode encodingthe termini type), and a blunt end or a single-stranded overhang thathybridizes and ligates to the substrate's overhang, when present. Theoverhangs were synthesized with equal proportions of random sequence oflength N (here up to 6 nucleotides (nt) long). Adapters were included inexcess to ensure that every template dsDNA type has access to acompatible adapter. In this way, adapters were introduced in acompetitive reaction that provided enough compatible sequences tohybridize with all possible sticky-ended template molecules. However,the overhangs of the adapters create the potential forself-hybridization and ligation, and therefore were not phosphorylatedto prevent adapter dimer formation.

During the initial step of this method, template DNA was treated withpolynucleotide kinase so as to phosphorylate 5′ termini. Aside from thephosphorylation, template DNA termini are not altered. Next, a two-stepligation was performed. First the 5′ phosphorylated template DNA wasligated to a pool of unphosphorylated, UEI-containing adapters. Thisfirst ligation occurs only at the forward (P5) adapter ends of bothtemplate strands. Next, purification was performed to remove excess,unligated adapters. Finally, the 5′ ends of the adapters werephosphorylated and a second ligation was performed—this time at thereverse (P7) adapter ends—in order to complete the dsDNA librarymolecule. Fully formed molecules were then indexed and amplified using auniversal P5 primer and a uniquely indexed P7 primer. Followingpaired-end sequencing on an Illumina sequencer, the UEI was used toclassify sequence reads by the type, length, and sequence of theoverhang.

Assessing Accuracy of DNA Termini Identification

Accuracy

To determine the accuracy of the assay described in this Example, a poolof 12 synthetic double-stranded control oligos were constructed, eachwith a known length and type (3′ or 5′) of single-stranded overhang.Each control oligo contained a unique and identifiable 50 bp core and acommon structure: blunt terminus on one side, and a 5′ or 3′ overhang ofa specific length (1 to 6 nt) on the other side. After sequencinglibraries, the UEIs on the reverse read (P7) were used to quantify theassay's accuracy by comparing how often the overhang indicated by theadapter UEI correctly matched the overhang engineered on the dsDNAcontrol template. Analysis was limited to reverse reads because the UEIpresent on the reverse adapter was more accurate in predicting thecorrect overhang than when the UEI present on both adapters wasincluded, or when only the forward adapter was included (FIG. 18 ). Amodel to explain this phenomenon is provided in FIG. 19 .

The specificity of the assay in this Example was measured in two waysusing the control oligo pools. First, the dataset was limited tocorrectly formed library molecules (i.e., monomeric control oligos thathave a sequence within an edit distance 1 of a true control oligosequence) and how often the correct overhang type and length wascaptured was calculated. For each control oligo, the most commonlyobserved adapter UEI was the correct adapter in all overhang types (3′and 5′) and lengths (1 to 6-nt) tested. However, a minor fraction oflibrary molecules was observed whose UEI did not correspond to the knownoverhang length or type for these control oligos. The overall UEIaccuracy over each overhang type and length was 84.94%+/−0.72% (95%C.I.). Next, the specificity of each UEI was measured by counting thetimes each UEI adapter was observed ligated to each type of syntheticoligo. For all 3′ UEIs and blunt UEI, the most common ligation event wasthe correct one. For all 5′ UEIs other than the 5′ 1-nt overhang, themost common ligation event was the correct one. However, taken togetheras a group, the 5′ overhangs have lower accuracy than that of the 3′overhangs. Errors most often occurred ±1-nt in distance from the correctlength, notably in the 5′ 1-nt, 5′ 3-nt, and 5′ 5-nt controls.

Base Composition

To determine if ligation accuracy or efficiency is influenced by thebase composition of the overhangs, sequence data and UEI data were usedto determine the nucleotide sequence of each recovered single-strandedoverhang. Due to the architecture of the libraries, the bases present inthe 5′ overhangs derived from the insert template molecule, which was inthis case the control oligo, whereas 3′ overhangs were derived from theDNA overhang of the adapter itself. A uniform distribution ofnucleotides was observed for each overhang type and length except for 5′1-nt overhangs, where an excess of cytosine was observed. To evaluatewhether this cytosine bias was a product of the oligo synthesis process,standard end repaired libraries were prepared (NEB Ultra II) with thesynthetic control oligos. The end repair step removed 3′ overhangs andfilled-in 5′ overhangs via polymerase activity, allowingcharacterization of the base composition of the synthetic DNA 5′overhangs. Within the standard end repaired libraries of the controloligos, an elevated read count of 5′ 1-nt cytosines was observed.Therefore, this observation is likely to derive from biases in thecustom oligo synthesis and not from biases introduced during ligation.

Sensitivity

To evaluate the ability of the adapters herein to detect the presence ofa specific terminus type in a background of extraneous DNA molecules, adilution series was performed in which DNA with a single known overhangsequence was mixed into a pool of diverse overhangs. The pool of diversetermini was created by sonication (Diagenode Bioruptor) and sizeselection of NA12878 genomic DNA (gDNA). The DNA with a single knownoverhang was created by digesting NA12878 gDNA with the restrictionendonuclease MluCl, which creates 5′ 4-nt overhangs of the sequenceAATT, followed by size selection.

First, libraries generated from the sonicated template and from theMluCl-digested template DNA were sequenced to characterize termini inboth samples. The overhang length distribution for the sonicated sample(FIG. 15 , panel A) showed that sonication shearing of DNA creates anonrandom profile characterized by a prevalence of blunt terminifollowed by 1 to 4-nt overhangs that occur on both the 5′ and 3′ terminiwith an excess for 3′ 1-nt and 3′ 2-nt overhangs. As expected for MluCl,the length distribution for the MluCl-digested DNA showed anoverwhelming excess of 5′ 4-nt overhangs (FIG. 15 , panel B).

To perform the dilution series, defined amounts of the MluCl-digestedDNA were mixed with the sonicated DNA sample, and then libraries weregenerated from the pooled mixtures. The pools contained from 1% up to50% MluCl-digested DNA. The percent of sequence reads in each librarythat were attributed to 5′ AATT overhangs (FIG. 15 , panel C; Table 5)was calculated. Overall, there was concordance between the known MluClfraction in the library pool and that observed as correct 5′ 4nucleotide overhangs with the correct AATT overhang within the sequenceddata. Libraries with higher fractions of MluCl-digested DNA (100%-10%)showed fewer than expected 5′ AATT overhangs, likely due to theoverabundance of compatible sticky template ends, whereas the lowerdilution libraries generated more accurate estimates of the known MluClfraction.

TABLE 5 Sequencing data for sensitivity experiments Number of 5′ Numberof % Reads Number of AATT % of 5′ AATT Read Pairs Mapped to Overhangs onObserved in in R2 Sample Type Sequenced hg 19 R2 Observed R2 OverhangsOverhangs 100% 138921 91 91011 41 0.05 Mechanical Shear  1% RE Digest132516 90 86052 647 0.75  5% RE Digest 134791 90 88194 2977 3.38  10% REDigest 129736 90 85432 4268 5.00  25% RE Digest 106698 88 68811 64999.44  50% RE Digest 125340 86 75954 12211 16.08 100% RE 97724 76 4413917523 39.70 Digest

Next, to estimate at what concentration of MluCl-digested DNA the 5′AATT signal is lost, the sonicated libraries containing titrated amountsof MluCl-digested DNA were compared to the control sonicated librarythat contained no spiked 5′ AATT overhangs. Even at the lowest dilutionin the series (1% MluCl) the occurrence of 5′ AATT overhangs wasdetected over all other overhangs, p<0.001 (FIG. 15 , panel D). Thisobservation indicates the assay in this Example, when compared to anappropriate control library, is sufficiently sensitive to discoveroverhang motifs that make up less than 1% of a library. Mapping theprecise genomic location of DNA ends sheared by sonication and digestedby MluCl showed a random distribution of overhangs and an overabundanceof expected 5′ 4-nt overhangs, respectively.

Overhang Profile of Common DNA Shearing Mechanisms

Fragmentation, or shearing, of high molecular weight DNA typically is anecessary step in creating short-read, e.g. Illumina, sequencinglibraries. Popular means of shearing DNA include sonication andenzymatic digestion. To explore whether mechanical shearing viasonication biases or otherwise affects the quality of NGS libraries,several studies have compared the consequences of shearing methods,including one that used a cloning based approach to examine terminicreated by DNA sonication, and another that analyzed 5′ overhangs instandard end repaired NGS libraries to find evidence for non-randomshearing by DNA sonication. To explore the consequences of enzymaticshearing, several studies have used molecular, microarray, andsequencing-based methods to describe the digestion preferences of DNaseIas well as the cut preferences of Micrococcal Nuclease but in lowresolution. Because libraries generated using adapters described hereincan assess the microstructure of all free DNA ends in an unbiasedhigh-throughput way, the assay described in this Example was used tocharacterize the DNA end profiles of naked NA12878 genomic DNAfragmented via both sonication and enzymatic shearing (Table 6).

TABLE 6 Sequencing data for common shearing mechanisms and nucleases %UEI Number of % Molecules Molecules Read Pairs that Mapped Sample TypeSequenced Contain UEIs to hg19 Bioruptor Shear 147,613 + 30,695 94.7 89NEB dsFragmentase 184,975 + 66,401 94 86 DNasel 184, 938 + 27,744 92 66Micrococcal Nuclease 178,653 + 30,933 95 82

Sonication

To interrogate the overhangs produced by sonication in greater detail,data generated from libraries sheared with a Diagenode Bioruptor wasexamined, as introduced above (see Sensitivity section). The overhanglength and sequence motif distributions for the sonicated sample showedthat sonication of DNA creates a higher prevalence of blunt endsfollowed by 1 to 4-nt overhangs that occur on both the 5′ and 3′ ends,but with a preference for 3′ 1-nt and 3′ 2-nt overhangs. The basecomposition profiles of the Bioruptor-sonicated DNA showed a balancedspectrum except for when a 1-nt overhang forms. The occurrence of a 1-ntDSB most often leaves a single cytosine overhang, a phenomenon observedpreviously on 5′1-nt overhangs.

Enzymatic Shearing

To interrogate the overhangs produced by enzymatic shearing, librarieswere generated using three endonucleases, dsFragmentase (New EnglandBiolabs, NEB), DNaseI and Micrococcal nuclease (MNase). The NEB productdsFragmentase is a cocktail of two enzymes designed specifically for NGSDNA fragmentation. The overhang length distribution from thedsFragmentase-digested sample created more molecules with overhangs andfewer blunt ended molecules when compared to sonication. ThedsFragmentase also created a more random shearing profile than theBioruptor, based on the variation in overhang lengths observed betweenlibrary replicates. The adapters used in this study extend only to 6-nt.The sheer number of molecules containing 6-nt overhangs indicatesdsFragmentase creates overhangs that are longer than 6-nt. The motifdistribution for the dsFragmentase overhangs showed an even distributionbetween the generation of 5′ and 3′ overhangs. The base composition ofthe dsFragmentase overhangs was similar to that of the Bioruptor, butwith more cytosines in overhangs of 2-nt to 6-nt in length.

Next, the overhangs produced by endonucleases DNaseI and MicrococcalNuclease (MNase) was interrogated. The overhang length plot forDNaseI-digested naked DNA showed a prevalence of 3′ 2-nt overhangs, aswell as 5′ 2-nt to 4-nt overhangs. Of the latter, overhangs of 3 or morenucleotides were GC-rich. Based on the relative abundance of 6-ntoverhangs in the overhang length distribution plot, it is likely thatDNaseI creates overhangs greater than 6-nt in length. The overhang basecomposition profile of DNaseI digested DNA showed a decreasingpreference for cytosine in the 5′ overhang as the length of the overhangincreased, and a slight preference for 3′1-nt thymine. The basecomposition of nucleotides upstream of DNaseI cut sites showed apreference for the cutting of DNA at A/T sites in the −1 position of 5′overhangs.

The overhang length plot for MNase digested DNA (FIG. 16 , top panel)showed that MNase has a strong preference for the creation of blunt DNAtermini (39.5% of the overhang data), with longer overhangs becomingdecreasingly likely. When an overhang is produced, MNase showed anoverall preference for A/T rich 5′ overhangs (excluding 1-nt overhangs)(FIG. 16 , bottom panel). The base composition of nucleotides upstreamof MNase cut sites shows that, although the actual 3′ overhangs producedby MNase are not as A/T rich as 5′ overhangs, the −1 position of 3′overhangs is overwhelmingly A/T rich.

These results show that the method described in this Example is able toreproduce the outcomes of previous studies that characterized thepreferences and biases of various DNA shearing methods. These resultsalso highlight the potential benefits of utilizing this method in thecharacterization of novel nucleases.

Recovery of Ends Generated by In Vivo Nuclease Activity in Whole Blood

Recently, circulating cell-free DNA (cfDNA) profiling has garneredconsiderable attention for use in non-invasive prenatal testing andcancer diagnostics. Obtaining high quality DNA from blood plasma beginswith blood collection itself. Blood coagulation or clotting is a processassociated with increased nuclease activity and the type of bloodcollection tube (BCT) may affect the quantity and quality of a cfDNAextract. Here, how common BCTs maintain cfDNA integrity was assayed byconstructing libraries of cfDNA extracted from various BCTs spiked withknown control oligos (described above). BCTs containing commonly usedanticoagulants were included (Table 3), and a control tube withoutanticoagulants (red top tube; RTT) was included.

Before extracting plasma (or serum) and isolating cfDNA as describedabove, control oligos were spiked into each of four tube types. Mixturesof control oligos and cfDNA were extracted at 0, 4, and 24 hoursfollowing the oligo spike-in and converted into libraries using adaptersdescribed herein. In Streck® BCTs (SBCTs), which contain additives thatinhibit nuclease activity and cell lysis, the human cfDNA fragmentlength profile or abundance did not change over time. Conversely,multi-nucleosome fragments appeared in YTTs (anticoagulant—citrate), andPTTs (anticoagulant—Potassium EDTA) at 24 hours, reminiscent ofapoptotic cellular gDNA. In the RTTs, multi-nucleosome fragments wereseen as early as 0 hours suggesting that apoptotic processes may beinitiated during blood coagulation that are associated with release ofendo- and exo-nucleases.

The incubation of whole blood containing control oligos prior to cfDNAextraction allowed the quantification of the amount of loss or change ofknown DNA ends due to nuclease(s) that remained active following theblood draw. No significant loss nor change to the control DNA endprofiles was observed in the SBCTs, the PTTs, or the negative controls(FIG. 17 ). In YTTs, which do not contain any known nuclease inhibitors,changes in both 3′ and 5′ overhang profiles were observed. These changesindicated the presence of one or more active circulating exonucleases(FIG. 17 ). By 24 hours the 3′ overhang signal of control oligos wassignificantly diminished in YTTs, suggesting the 3′ to 5′ exonuclease(s)may be more processive than the 5′ to 3′ exonuclease(s). In the RTTs,the complete loss of 3′ overhang counts was observed within 4 hours, aswell as depletion of true blunt-ends, identified by the generation ofnew overhangs on formerly blunt molecules. By 24 hours, the controloligos were no longer visible in RTTs, as was expected in a highlyactive nuclease environment. In sum, these observations show that themethod described in this Example discerns changes in overhang patternsof cfDNA and can be used to investigate the effect(s) of circulatingnucleases in the blood.

Example 12: Analysis of Overhang Sequencing Datasets

Nucleic acid overhangs of a population of nucleic acids in a sample maybe interrogated (e.g., using bioinformatic analyses) to generate anoverhang profile comprising one or more features of the overhangs (e.g.,quantification of certain overhang types (e.g., 5′, 3′, blunt),quantification of certain overhang lengths, quantification of certainoverhang sequence features, and the like). In certain instances,features of the template molecules may be considered. Based on theoverhang profile and/or certain template features, one or morecharacteristics of the sample may be determined.

Provided in Table 7 are example feature variables (e.g., overhangfeatures; template features) that may be used to determine one or morecharacteristics of a sample.

TABLE 7 Example feature variables Feature Examples presence/absence ofoverhang dinucleotide/trinucleotide/tetranucleotide* template + overhangtemplate minus overhang dinucleotide/trinucleotide/tetranucleotide*count overhang template + overhang template minus overhangdinucleotide/trinucleotide/tetranucleotide* percent overhang template +overhang template minus overhang full length of template length category(e.g., for cfDNA) subnucleosome mononucleosome multinucleosome length ofoverhang overhang type 5′ overhang 3′overhang blunt (no overhang) GCcontent overhang template + overhang overhang percent template minusoverhang log 2 percent of overhang sequence/total overhangs overhangcount counts of particular overhang sequence percent overhang lengthlength of overhang/full length of templatedinucleotide/trinucleotide/tetranucleotide* count in overhang vs. entiresequence of template Boolean variables: whether overhang overlaps codingregions with/is contained in/starts or ends in particular CpG islandsregions transcription factor binding sites (e.g., CCCTC- binding factor(CTCF) binding site) DNAse hypersensitive sites ATAC-seq peaks (e.g.,open chromatin) promoter regions enhancer regions hypermethylatedregions genome coordinate mean fragment length or distribution ofmolecules with a given overhang type and length mean fragment length ordistribution of molecules with a given overhang sequence delta betweenlibraries mean length or distribution of fragments with a given overhangsequence vs. its X** difference between library A and library B, where Ahas mean fragment length of 200 (or any X variable) and 0 count of GCdinucleotide (or any Y variable) compared to B with mean fragment lengthof 100 and 10 GC dinucleotide *Example dinucleotides include AA, AT, AC,AG, TT, TA, TC, TG, CC, CG, CA, CT, GG, GA, GC, GT; 43 possibletrinucleotide combinations; 44 possible tetranucleotide combinations **X = any feature above

An example of certain bioinformatic analyses of overhang data and therelation to a sample characteristic is described below.

Heat Map of Overhang Data

Libraries were generated using Y-shaped overhang adapters describedherein (see e.g., the Y-shaped adapter and method shown in FIG. 9B) forcell-free DNA from donors having cancer (“cancer donors”) and healthydonors. A heat map (FIG. 20 ) was generated from sequencing data of DNAoverhangs present in each library using Ward's hierarchical clusteringmethod. Each column of the heat map shown in FIG. 20 represents a singlecell-free DNA library from a cancer donor (black bar) or healthy donor(no bar). Each row of the heat map shown in FIG. 20 represents a uniqueoverhang (5′ or 3′) of length 1 to 6 nucleotides; rows (overhangs)containing at least one CG dinucleotide, or CpG, are indicated by a greybar. Within the heat map matrix shown in FIG. 20 , the darker the color,the increasing proportion (log scaled) that overhang represents in thelibrary. Lighter colors indicate depletion of that overhang.

As shown in FIG. 20 , the majority of CpG-containing overhangs areclustered towards one end of the tree, with a few smaller clustersthroughout. One primary cluster of cancer donors was observed (FIG. 20 ,second clade from top left), the majority (12 of 13) of which had GIcancers. These samples showed a lower percent of the libraries havingCG-containing overhangs (depletion), whereas healthy donors tended tohave a higher percentage of libraries having CG-containing overhangs(enrichment). Thus, with respect to CG-containing overhangs, a patternof depletion was observed in certain clusters of cancer cell-free DNAlibraries.

Machine Learning Approaches Using Overhang Sequence Data

A Logistic Regression and a supervised learning algorithm (supportvector machine (SVM)) were performed classifying cancer and healthysamples with variables including CG count, overhang length, fullmolecule length, and other variables as set forth in FIG. 21 . Both SVMand Logistic Regression had an accuracy of 75%, with precision andrecall above 75% (precision—ability of model to label sample as trulypositive; recall—ability of model to find all positive samples).

Variables used in the models are provided in FIG. 21 . After RecursiveFeature Elimination all variables were considered a best performingfeature after repeating the process of creating the model with differentsubsets—recursing on these with smaller and smaller sets of features.

Logistical Regression Classifier (Cancer Vs. Healthy—Confusion Matrix)

The predicted accuracy of the logistic regression classifier on the testset was 75%, with a better split between true positives, falsepositives, true negatives and false negatives—790+823 correctpredictions and 308+230 incorrect predictions (FIG. 22 ).

SVM Classifier (Cancer Vs. Healthy—Classification Report and ROC)

There was a 73% ability of model to label sample as truly positive(precision) and 78% ability of model to find all positive samples(recall). The final model split the data into 30% test set and 70% trainset. The f1-score is the harmonic mean between precision and recall andsupport is the number of samples taken into account from the test set.The accuracy of the classification was 75%.

GI Cancer Vs. Healthy—Model Summary

The odds of a patient having cancer increases by 120% if the overhangcontains the CG dinucleotide sequence. The odds of a patient havingcancer increases by 105% if the overhang contains the GG dinucleotidesequence, and 50% if the overhang contains the GC dinucleotide sequence.All feature variables had significant p-values, P>|z|, in finalmodel—given a cut-off of 0.05.

GI Cancer Vs. Other (Includes Healthy and Other Cancer)—Model Summary

The odds of a patient having cancer increases by 94% if the overhangcontains the CG dinucleotide sequence. All feature variables hadsignificant p-values, P>|z|, in final model—given a cut-off of 0.05.

Example Classifier

Cancer vs healthy

-   -   In [1]:    -   import pandas as pd    -   import numpy as np    -   from sklearn import preprocessing    -   import matplotlib.pyplot as plt    -   from sklearn.linear_model import LogisticRegression    -   from sklearn.model_selection import train_test_split    -   import dask.dataframe as dd    -   In [2]:    -   dt=pd.read_csv(‘APN_cpg_out.csv’, sep=‘,’, header=None,        dtype={0: str, 1: str, 2: int, 3: str, 4: str, 5: int, 6: int,        7: object, 8: int, 9: str, 10: str, 11: float})    -   dt.columns=[‘barcode’, ‘overhang_type’, ‘overhang_length’,        ‘overhang_seq’, ‘aligned_seq’, ‘start end’, ‘aligned_len’,        ‘full_len’, ‘chr’, ‘lib_name’, ‘cpg_island’]    -   In [3]:    -   dt.loc[(dt[‘lib_name’]==‘APN1047’)|(dt[‘lib_name’]==‘APN1048’)|(dt[‘lib_name’]==‘APN1049’)|dt[‘lib_name’]==‘APN1050’)|(dt[‘lib_name’]==‘APN1051’)|(dt[‘lib_name’]==‘APN1052’)|(dt[‘lib_name’]==‘APN1053’)|(dt[‘lib_name’]==‘APN1054’)|(dt[‘lib_name’]==‘APN1055’)|(dt[‘lib_name’]==‘APN1056’)|(dt[‘lib_name’]==‘APN1057’)|(dt[‘lib_name’]==‘APN1058’)|(dt[‘lib_name’]==‘APN816’)|(dt[‘lib_name’]==‘APN1021’)|(dt[‘lib_name’]==‘APN1022’)|(dt[‘lib_name’]==‘APN1026’)|(dt[‘lib_name’]==‘APN1027’)|(dt[‘lib_name’]==‘APN1028’)|(dt[‘lib_name’]==‘APN1029’),        ‘y’]=‘1’        dt.loc[(dt[‘lib_name’]==‘APN815’)|(dt[‘lib_name’]==‘APN823’)|(dt[‘lib_name’]==‘APN824’)|(dt[‘lib_name’]==‘APN825’)|(dt[‘lib_name’]==‘APN826’)|(dt[‘lib_name’]==‘APN827’)|(dt[‘lib_name’]==‘APN828’)|(dt[‘lib_name’]==‘APN829’)|(dt[‘lib_name’]==‘APN830’)|(dt[‘lib_name’]==‘APN831’)|(dt[‘lib_name’]==‘APN832’)|(dt[‘lib_name’]==‘APN833’)|(dt[‘lib_name’]==‘APN834’)|(dt[‘lib_name’]==‘APN835’)|(dt[‘lib_name’]==‘APN886’)|(dt[‘lib_name’]==‘APN887’)|(dt[‘lib_name’]==‘APN888’)|(dt[‘lib_name’]==‘APN890’)|(dt[‘lib_name’]==‘APN908’)|(dt[‘lib_name’]==‘APN909’)|(d        t[‘lib_name’]==‘APN911’)|(dt[‘lib_name’]==‘APN709’)|(dt[‘lib_name’]==‘APN710’)|(dt[‘lib_name’]==‘APN711’)|(dt[‘lib_name’]==‘APN713’)|(dt[‘lib_name’]==‘APN716’)|(dt[‘lib_name’]==‘APN717’)|(dt[‘lib_name’]==‘APN718’)|(dt[‘lib_name’]==‘APN719’)|(dt[‘lib_name’]==‘APN720’)|(dt[‘lib_name’]==‘APN721’)|(dt[‘lib_name’]==‘APN722’)|(dt[‘lib_name’]==‘APN723’)|(dt[‘lib_name’]==‘APN724’)|(dt[‘lib_name’]==‘APN725’)|(dt[‘lib_name’]==‘APN726’)|(dt[‘lib_name’]==‘APN727’)        (dt[‘lib_name’]==‘APN728’)|(dt[‘lib_name’]==‘APN729’)|(dt[‘lib_name’]==‘APN730’)|(dt[‘lib_name’]==‘APN731’)|(dt[‘lib_name’]==‘APN732’)|(dt[‘lib_name’]==‘APN735’)|(dt[‘lib_name’]==‘APN807’)|(dt[‘lib_name’]==‘APN808’)|(dt[‘lib_name’]==‘APN809’)|(dt[‘lib_name’]==‘APN810’)|(dt[‘lib_name’]==‘APN811’)|(dt[‘lib_name’]==‘APN812’)|(dt[‘lib_name’]==‘APN913’)|(dt[‘lib_name’]==‘APN914’)|(dt[‘lib_name’]==‘APN915’)|(dt[‘lib_name’]==‘APN916’)|(dt[‘lib_name’]==‘APN917’)|(dt[‘lib_name’]==‘APN918’)|(dt[‘lib_name’]==‘APN919’)|(dt[‘lib_name’]==‘APN920’)|(dt[‘lib_name’]==‘APN921’)|(dt[‘lib_name’]==‘APN922’)|(dt[‘lib_name’]==‘APN923’)|(dt[‘lib_name’]==‘APN924’)|(dt[‘lib_name’]==‘APN925’)|(dt[‘lib_name’]==‘APN926’)|(dt[‘lib_name’]==‘APN927’)|(dt[‘lib_name’]==‘APN1017’)|(dt[‘lib_name’]==‘APN1018’)|(dt[‘lib_name’]==‘APN1019’)        1        (dt[‘lib_name’]==‘APN1020’)|(dt[‘lib_name’]==‘APN1023’)|(dt[‘lib_name’]==‘APN1024’)|(dt[‘lib_name’]==‘APN1025’),        ‘y’]=‘0’    -   In [4]:    -   dt[‘overhang_count’]=dt.groupby(‘overhang_seq’)[‘overhang_seq’].transform(‘count’)    -   dt.loc[dt.overhang_type==“5”, ‘otype’]=1    -   dt.loc[dt.overhang_type==“3”, ‘otype’]=0    -   dt.loc[dt.overhang_type==“BL”, ‘otype’]=0    -   dt.loc[dt.full_len<120, ‘len_cat’]=0    -   dt.loc[120<=dt.full_len, ‘len_cat’]=1    -   In [5]:    -   dt[‘AA_count_oh’]=dt[‘overhang_seq’].str.count(‘AA’)    -   dt[‘AC_count_oh’]=dt[‘overhang_seq’].str.count(‘AC’)    -   dt[‘AT_count_oh’]=dt[‘overhang_seq’].str.count(‘AT’)    -   dt[‘AG_count_oh’]=dt[‘overhang_seq’].str.count(‘AG’)    -   dt[‘CA_count_oh’]=dt[‘overhang_seq’].str.count(‘CA’)    -   dt[‘CC_count_oh’]=dt[‘overhang_seq’].str.count(‘CC’)    -   dt[‘CT_count_oh’]=dt[‘overhang_seq’].str.count(‘CT’)    -   dt[‘CG_count_oh’]=dt[‘overhang_seq’].str.count(‘CG’)    -   dt[‘TA_count_oh’]=dt[‘overhang_seq’].str.count(‘TA’)    -   dt[‘TC_count_oh’]=dt[‘overhang_seq’].str.count(‘TC’)    -   dt[‘TT_count_oh’]=dt[‘overhang_seq’].str.count(‘TT’)    -   dt[‘TG_count_oh’]=dt[‘overhang_seq’].str.count(‘TG’)    -   dt[‘GA_count_oh’]=dt[‘overhang_seq’].str.count(‘GA’)    -   dt[‘GC_count_oh’]=dt[‘overhang_seq’].str.count(‘GC’)    -   dt[‘GT_count_oh’]=dt[‘overhang_seq’].str.count(‘GT’)    -   dt[‘GG_count_oh’]=dt[‘overhang_seq’].str.count(‘GG’)    -   In [9]:    -   dt[‘y’]=pd.to_numeric(dt[‘y’])    -   In [10]:    -   data=dt.groupby(‘overhang_seq’).mean( )    -   data.loc[(data[‘y’]>=0.5), ‘y’]=1    -   data.loc[(data[‘y’]<0.5), ‘y’]=0    -   In [13]:    -   data[‘perc_len’]=np.log        2(data[‘overhang_length’]/data[‘full_len’])    -   data[‘AA_perc’]=(data[‘AA_count_oh’]/data[‘overhang_length’])*100    -   data[‘AC_perc’]=(data[‘AC_count_oh’]/data[‘overhang_length’])*100    -   data[‘AT_perc’]=(data[‘AT_count_oh’]/data[‘overhang_length’])*100    -   data[‘AG_perc’]=(data[‘AG_count_oh’]/data[‘overhang_length’])*100    -   data[‘CA_perc’]=(data[‘CA_count_oh’]/data[‘overhang_length’])*100    -   data[‘CC_perc’]=(data[‘CC_count_oh’]/data[‘overhang_length’])*100    -   data[‘CT_perc’]=(data[‘CT_count_oh’]/data[‘overhang_length’])*100    -   data[‘CG_perc’]=(data[‘CG_count_oh’]/data[‘overhang_length’])*100    -   data[‘TA_perc’]=(data[‘TA_count_oh’]/data[‘overhang_length’])*100    -   data[‘TC_perc’]=(data[‘TC_count_oh’]/data[‘overhang_length’])*100    -   data[‘TT_perc’]=(data[‘TT_count_oh’]/data[‘overhang_length’])*100    -   data[‘TG_perc’]=(data[‘TG_count_oh’]/data[‘overhang_length’])*100    -   data[‘GA_perc’]=(data[‘GA_count_oh’]/data[‘overhang_length’])*100    -   data[‘GC_perc’]=(data[‘GC_count_oh’]/data[‘overhang_length’])*100    -   data[‘GT_perc’]=(data[‘GT_count_oh’]/data[‘overhang_length’])*100    -   data[‘GG_perc’]=(data[‘GG_count_oh’]/data[‘overhang_length’])*100    -   data[‘overhang_perc’]=np.log        2(data[‘overhang_count’]/len(data.columns))    -   data.head( )    -   In [14]:    -   data.fillna(0, inplace=True)    -   np.any(np.isnan(data))    -   In [16]:    -   len(list(data.keys( )))    -   In [17]:    -   X=data.loc[:, data.columns !=‘y’]    -   y=data.loc[:, data.columns==‘y’]    -   In [18]:    -   from imblearn.over_sampling import SMOTE    -   os=SMOTE(random_state=0)    -   X_train, X_test, y_train, y_test=train_test_split(X, y,        test_size=0.3, random_state=0)    -   columns=X_train.columns    -   In [19]:    -   os_data_X, os_data_y=os.fit_sample(X_train,        y_train.values.ravel( ))    -   os_data_X=pd.DataFrame(data=os_data_X,columns=columns)    -   os_data_y=pd.DataFrame(data=os_data_y,columns=[‘y’])    -   In [20]:    -   data_final_vars=dt.columns.values.tolist( )    -   y=[‘y’]    -   X=[i for i in data_final_vars if i not in y]    -   In [21]:    -   logreg=LogisticRegression( )    -   In [22]:    -   from sklearn.feature_selection import RFE    -   rfe=RFE(logreg,56)    -   rfe=rfe.fit(os_data_X, os_data_y.values.ravel( ))    -   print(rfe.support_)    -   print(rfe.ranking_)    -   In [32]:    -   cols=[‘overhang_length’,    -   ‘start’,    -   ‘end’,    -   ‘full_len’,    -   ‘overhang_count’,    -   ‘AT_count_oh’,    -   ‘TA_count_oh’,    -   ‘TG_count_oh’,    -   ‘GT_count_oh’,    -   ‘AC_count_al’,    -   ‘AT_count_al’,    -   ‘AG_count_al’,    -   ‘CA_count_al’,    -   ‘CC_count_al’,    -   ‘CT_count_al’,    -   ‘CG_count_al’,    -   ‘TA_count_al’,    -   ‘TC_count_al’,    -   ‘TT_count_al’,    -   ‘GA_count_al’,    -   ‘GC_count_al’,    -   ‘perc_len’,    -   ‘AG_perc’,    -   ‘TA_perc’,    -   ‘TG_perc’,    -   ‘GA_perc’,    -   ‘GC_perc’,    -   ‘GT_perc’,    -   ‘overhang_perc’]    -   X=os_data_X[cols]    -   y=os_data_y[‘y’]    -   logit_model=sm.Logit(y,X)    -   result=logit_model.fit( )    -   print(result.summary2( ))    -   In [33]:    -   from sklearn.linear_model import LogisticRegression    -   from sklearn import metrics    -   X_train, X_test, y_train, y_test=train_test_split(X, y,        test_size=0.3, random_state=0)    -   logreg=LogisticRegression( )    -   logreg.fit(X_train, y_train)    -   In [34]:    -   y_pred=logreg.predict(X_test)    -   print(‘Accuracy of logistic regression classifier on test set:        {:.2f}’.format(logreg.score(X_test, y_test)))    -   In [35]:    -   import seaborn as sn    -   from sklearn.metrics import confusion_matrix    -   confusion_matrix=confusion_matrix(y_test, y_pred)    -   print(confusion_matrix)    -   plt.figure(figsize=(10,7))    -   sn.heatmap(confusion_matrix, annot=True)    -   In [36]:    -   from sklearn.metrics import classification_report    -   print(classification_report(y_test, y_pred))    -   In [37]:    -   from sklearn.metrics import roc_auc_score    -   from sklearn.metrics import roc_curve    -   logit_roc_auc=roc_auc_score(y_test, logreg.predict(X_test))    -   fpr, tpr, thresholds=roc_curve(y_test,        logreg.predict_proba(X_test)[:,1])    -   plt.figure( )    -   plt.plot(fpr, tpr, label=‘Logistic Regression (area=%0.2f)’ %        logit_roc_auc)    -   plt.plot([0, 1], [0, 1],‘r--’)    -   plt.xlim([0.0, 1.0])    -   plt.ylim([0.0, 1.05])    -   plt.xlabel(‘False Positive Rate’)        plt.ylabel(‘True Positive Rate’)        plt.title(‘Receiver operating characteristic’)        plt.legend(loc=“lower right”)        plt.savefig(‘Log_ROC’)        plt.show( )

Example 13: Examples of Embodiments

The examples set forth below illustrate certain embodiments and do notlimit the technology.

A1. A method for producing a nucleic acid library, comprising:

-   -   (a) combining a nucleic acid composition comprising target        nucleic acids and a plurality of oligonucleotide species,        wherein:        -   (i) each oligonucleotide in the plurality of oligonucleotide            species comprises one strand capable of forming a hairpin            structure having a single-stranded loop, wherein the loop            comprises one or more ribonucleic acid (RNA) nucleotides,        -   (ii) some or all of the target nucleic acids comprise an            overhang,        -   (iii) some or all of the oligonucleotides in the plurality            of oligonucleotide species comprise an overhang capable of            hybridizing to a target nucleic acid overhang, wherein each            oligonucleotide species has a unique overhang sequence and            length,        -   (iv) each oligonucleotide in the plurality of            oligonucleotide species comprises an oligonucleotide            overhang identification sequence specific to one or more            features of the oligonucleotide overhang, and        -   (v) the nucleic acid composition and the plurality of            oligonucleotide species is combined under conditions in            which oligonucleotide overhangs hybridize to target nucleic            acid overhangs having a corresponding length, thereby            forming hybridization products; and    -   (b) contacting the hybridization products under cleavage        conditions with one or more cleavage agents capable of cleaving        the hybridization products within the hairpin loop at the RNA        nucleotide(s), thereby forming cleaved hybridization products.

A2. The method of embodiment A1, wherein each oligonucleotide in theplurality of oligonucleotide species consists of one strand capable offorming a hairpin structure having a single-stranded loop.

A3. The method of embodiment A1 or A2, wherein the loop comprises twoRNA nucleotides.

A4. The method of embodiment A1 or A2, wherein the loop comprises threeRNA nucleotides.

A5. The method of embodiment A1 or A2, wherein the loop comprises fourRNA nucleotides.

A6. The method of any one of embodiments A1 to A5, wherein the loopcomprises one or more ribonucleic acid (RNA) nucleotides chosen fromadenine (A), cytosine (C), and guanine (G).

A7. The method of any one of embodiments A1 to A6, wherein the RNAnucleotides comprise guanine (G).

A8. The method of any one of embodiments A1 to A6, wherein the RNAnucleotides consist of guanine (G).

A9. The method of any one of embodiments A1 to A6, wherein the RNAnucleotides comprise cytosine (C).

A10. The method of any one of embodiments A1 to A6, wherein the RNAnucleotides consist of cytosine (C).

A11. The method of any one of embodiments A1 to A6, wherein the RNAnucleotides comprise adenine (A).

A12. The method of any one of embodiments A1 to A6, wherein the RNAnucleotides consist of adenine (A).

A13. The method of any one of embodiments A1 to A6, wherein the RNAnucleotides consist of adenine (A), cytosine (C), and guanine (G).

A14. The method of any one of embodiments A1 to A6, wherein the RNAnucleotides consist of adenine (A) and cytosine (C).

A15. The method of any one of embodiments A1 to A6, wherein the RNAnucleotides consist of adenine (A) and guanine (G).

A16. The method of any one of embodiments A1 to A6, wherein the RNAnucleotides consist of cytosine (C) and guanine (G).

A17. The method of any one of embodiments A1 to A16, wherein theplurality of oligonucleotide species comprises oligonucleotides having a5′ overhang and oligonucleotides having a 3′ overhang.

A18. The method of any one of embodiments A1 to A17, wherein theplurality of oligonucleotide species comprises oligonucleotides having a5′ overhang, oligonucleotides having a 3′ overhang, and oligonucleotideshaving no overhang.

A19. The method of any one of embodiments A1 to A18, wherein theoligonucleotides that comprise an overhang comprise a single-strandedloop, a duplex portion, and a single-stranded overhang.

A20. The method of any one of embodiments A1 to A19, wherein theoligonucleotide overhang comprises 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11,12, 13, 14, 15, 16, 17, 18, 19 or 20 nucleotides.

A21. The method of any one of embodiments A1 to A20, wherein theoligonucleotides in the plurality of oligonucleotide species compriseoligonucleotide overhangs having different sequences for a particularoverhang length.

A22. The method of embodiment A21, wherein the oligonucleotides in theplurality of oligonucleotide species comprise all possible overhangsequence combinations for a particular overhang length.

A23. The method of embodiment A22, wherein the oligonucleotides in theplurality of oligonucleotide species comprise all possible overhangsequence combinations for each overhang length.

A24. The method of embodiment A21, A22 or A23, wherein theoligonucleotide overhang sequences are random.

A25. The method of any one of embodiments A1 to A18, wherein theoligonucleotides that comprise no overhang comprise a single-strandedloop and a duplex portion.

A26. The method of any one of embodiments A1 to A25, wherein an end ofan oligonucleotide is capable of being covalently linked to an end of atarget nucleic acid to which the oligonucleotide is hybridized in thehybridization products.

A27. The method of embodiment A26, wherein the 3′ end of anoligonucleotide strand is capable of being covalently linked to the 5′end of a strand in the target nucleic acid to which the oligonucleotideis hybridized in a hybridization product.

A28. The method of any one of embodiments A1 to A27, wherein thehybridization products comprise a duplex region and at least onesingle-stranded loop.

A29. The method of any one of embodiments A1 to A28, wherein thehybridization products comprise a duplex region and a single-strandedloop at each end.

A30. The method of embodiment A28 or A29, wherein the one or morecleavage agents are capable of cleaving the hybridization productswithin the hairpin loop at the RNA nucleotide(s) and are not capable ofcleaving the hybridization products within the duplex region.

A31. The method of any one of embodiments A1 to A30, wherein the one ormore cleavage agents comprise a ribonuclease (RNAse).

A32. The method of embodiment A31, wherein the RNAse is anendoribonuclease.

A33. The method of embodiment A31 or A32, wherein the RNAse is chosenfrom one or more of RNAse A, RNAse E, RNAse F, RNAse H, RNAse III, RNAseL, RNAse P, RNAse PhyM, RNAse T1, RNAse T2, RNAse U2, and RNAse V.

A34. The method of any one of embodiments A1 to A33, wherein theoligonucleotide overhang identification sequence is specific to lengthof the oligonucleotide overhang.

A35. The method of any one of embodiments A1 to A34, wherein theoligonucleotide overhang identification sequence is specific to lengthof the oligonucleotide overhang and is specific to one or more featuresof the oligonucleotide overhang chosen from (i) a 5′ overhang, (ii) a 3′overhang, (iii) a particular sequence, (iv) a combination of (i) and(iii), or (v) a combination of (ii) and (iii).

A36. The method of any one of embodiments A1 to A35, wherein some of thetarget nucleic acids comprise no overhang.

A37. The method of any one of embodiments A1 to A36, wherein anoligonucleotide species comprises no overhang and comprises anoligonucleotide overhang identification sequence specific to having nooverhang.

A38. The method of any one of embodiments A1 to A37, wherein the targetnucleic acids comprising an overhang comprise a duplex region and asingle-stranded overhang.

A39. The method of any one of embodiments A1 to A38, wherein each targetnucleic acid comprising an overhang comprises an overhang at one end oran overhang at both ends.

A40. The method of any one of embodiments A1 to A39, wherein an end, orboth ends, of each target nucleic acid comprising an overhangindependently comprises a 5′ overhang or a 3′ overhang.

A41. The method of any one of embodiments A1 to A40, wherein the targetnucleic acids comprise deoxyribonucleic acid (DNA) fragments.

A42. The method of embodiment A41, wherein the DNA fragments areobtained from cells.

A43. The method of embodiment A41 or A42, wherein the DNA fragmentscomprise genomic DNA fragments.

A44. The method of any one of embodiments A1 to A40, wherein the targetnucleic acids comprise ribonucleic acid (RNA) fragments.

A45. The method of embodiment A44, wherein the RNA fragments areobtained from cells.

A46. The method of any one of embodiments A1 to A45, wherein the targetnucleic acids comprise cell-free nucleic acid fragments.

A47. The method of any one of embodiments A1 to A46, wherein the targetnucleic acids comprise circulating cell-free nucleic acid fragments.

A48. The method of any one of embodiments A1 to A47, wherein theoverhangs in target nucleic acids are native overhangs.

A49. The method of any one of embodiments A1 to A48, wherein theoverhangs in target nucleic acids are unmodified overhangs.

A50. The method of any one of embodiments A1 to A49, wherein the targetnucleic acids are not modified in length prior to combining with theplurality of oligonucleotide species.

A51. The method of any one of embodiments A1 to A50, comprisingpreparing the nucleic acid composition prior to (a), by a processconsisting essentially of isolating nucleic acid from a sample, therebygenerating the nucleic acid composition.

A52. The method of any one of embodiments A1 to A51, comprising exposingthe hybridization products to conditions under which an end of thetarget nucleic acid is joined to an end of the oligonucleotide to whichit is hybridized.

A53. The method of embodiment A52, comprising contacting thehybridization products with an agent comprising a ligase activity underconditions in which an end of a target nucleic acid is covalently linkedto an end of the oligonucleotide to which the target nucleic acid ishybridized.

A54. The method of any one of embodiments A1 to A53, comprising prior to(a), contacting the target nucleic acid composition with an agentcomprising a phosphatase activity under conditions in which targetnucleic acids are dephosphorylated, thereby generating adephosphorylated target nucleic acid composition.

A55. The method of embodiment A54, comprising prior to (a), contactingthe dephosphorylated target nucleic acid composition with an agentcomprising a phosphoryl transfer activity under conditions in which a 5′phosphate is added to a 5′ end of target nucleic acids.

A56. The method of any one of embodiments A1 to A55, comprising prior to(a), contacting the plurality of oligonucleotide species with an agentcomprising a phosphatase activity under conditions in which theoligonucleotides are dephosphorylated, thereby generating a plurality ofdephosphorylated oligonucleotide species.

A57. The method of embodiment A56, comprising prior to (a), contactingthe dephosphorylated oligonucleotide species with an agent comprising aphosphoryl transfer activity under conditions in which a 5′ phosphate isadded to a 5′ end of oligonucleotide species.

A58. The method of any one of embodiments A1 to A57, wherein the targetnucleic acids are obtained from a sample from a subject.

A59. The method of embodiment A58, wherein the subject is a human.

A60. The method of any one of embodiments A1 to A59, comprising prior to(a), separating the target nucleic acids according to fragment length.

A61. The method of embodiment A60, wherein target nucleic acids havingfragment lengths of less than about 500 bp are combined with theplurality of oligonucleotide species.

A62. The method of embodiment A60, wherein target nucleic acids havingfragment lengths of about 500 bp or more are combined with the pluralityof oligonucleotide species.

A63. The method of any one of embodiments A1 to A62, wherein theoligonucleotide overhang comprises DNA nucleotides.

A64. The method of any one of embodiments A1 to A62, wherein theoligonucleotide overhang consists of DNA nucleotides.

A65. The method of any one of embodiments A1 to A62, wherein theoligonucleotide overhang comprises RNA nucleotides.

A66. The method of any one of embodiments A1 to A62, wherein theoligonucleotide overhang consists of RNA nucleotides.

A67. The method of embodiment A65 or A66, comprising contacting thehybridization products with an agent comprising a RNA ligase activityunder conditions in which an end of a target nucleic acid is covalentlylinked to an end of the oligonucleotide to which the target nucleic acidis hybridized.

A68. The method of any one of embodiments A65 to A67, comprisingcontacting the hybridization products with an agent comprising an RNAseactivity under conditions in which double-stranded RNA duplexes aredigested.

B1. A composition comprising a plurality of oligonucleotide species,wherein:

-   -   (a) each oligonucleotide in the plurality of oligonucleotide        species comprises one strand capable of forming a hairpin        structure having a single-stranded loop, wherein the loop        comprises one or more ribonucleic acid (RNA) nucleotides;    -   (b) some or all of the oligonucleotides in the plurality of        oligonucleotide species comprise an overhang capable of        hybridizing to an overhang in a target nucleic acid, wherein        each oligonucleotide species has a unique overhang sequence and        length; and    -   (c) each oligonucleotide in the plurality of oligonucleotide        species comprises an oligonucleotide overhang identification        sequence specific to one or more features of the oligonucleotide        overhang.

B2. The composition of embodiment B1, wherein each oligonucleotide inthe plurality of oligonucleotide species consists of one strand capableof forming a hairpin structure having a single-stranded loop.

B3. The composition of embodiment B1 or B2, wherein the loop comprisestwo RNA nucleotides.

B4. The composition of embodiment B1 or B2, wherein the loop comprisesthree RNA nucleotides.

B5. The composition of embodiment B1 or B2, wherein the loop comprisesfour RNA nucleotides.

B6. The composition of any one of embodiments B1 to B5, wherein the loopcomprises one or more ribonucleic acid (RNA) nucleotides chosen fromadenine (A), cytosine (C), and guanine (G).

B7. The composition of any one of embodiments B1 to B6, wherein the RNAnucleotides comprise guanine (G).

B8. The composition of any one of embodiments B1 to B6, wherein the RNAnucleotides consist of guanine (G).

B9. The composition of any one of embodiments B1 to B6, wherein the RNAnucleotides comprise cytosine (C).

B10. The composition of any one of embodiments B1 to B6, wherein the RNAnucleotides consist of cytosine (C).

B11. The composition of any one of embodiments B1 to B6, wherein the RNAnucleotides comprise adenine (A).

B12. The composition of any one of embodiments B1 to B6, wherein the RNAnucleotides consist of adenine (A).

B13. The composition of any one of embodiments B1 to B6, wherein the RNAnucleotides consist of adenine (A), cytosine (C), and guanine (G).

B14. The composition of any one of embodiments B1 to B6, wherein the RNAnucleotides consist of adenine (A) and cytosine (C).

B15. The composition of any one of embodiments B1 to B6, wherein the RNAnucleotides consist of adenine (A) and guanine (G).

B16. The composition of any one of embodiments B1 to B6, wherein the RNAnucleotides consist of cytosine (C) and guanine (G).

B17. The composition of any one of embodiments B1 to B16, wherein theplurality of oligonucleotide species comprises oligonucleotides having a5′ overhang and oligonucleotides having a 3′ overhang.

B18. The composition of any one of embodiments B1 to B17, wherein theplurality of oligonucleotide species comprises oligonucleotides having a5′ overhang, oligonucleotides having a 3′ overhang, and oligonucleotideshaving no overhang.

B19. The composition of any one of embodiments B1 to B18, wherein theoligonucleotides that comprise an overhang comprise a single-strandedloop, a duplex portion, and a single-stranded overhang.

B20. The composition of any one of embodiments B1 to B19, wherein theoligonucleotide overhang comprises 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11,12, 13, 14, 15, 16, 17, 18, 19 or 20 nucleotides.

B21. The composition of any one of embodiments B1 to B20, wherein theoligonucleotides in the plurality of oligonucleotide species compriseoligonucleotide overhangs having different sequences for a particularoverhang length.

B22. The composition of embodiment B21, wherein the oligonucleotides inthe plurality of oligonucleotide species comprise all possible overhangsequence combinations for a particular overhang length.

B23. The composition of embodiment B22, wherein the oligonucleotides inthe plurality of oligonucleotide species comprise all possible overhangsequence combinations for each overhang length.

B24. The composition of embodiment B21, B22 or B23, wherein theoligonucleotide overhang sequences are random.

B25. The composition of any one of embodiments B1 to B18, wherein theoligonucleotides that comprise no overhang comprise a single-strandedloop and a duplex portion.

B26. The composition of any one of embodiments B1 to B25, wherein theoligonucleotide overhang identification sequence is specific to lengthof the oligonucleotide overhang.

B27. The composition of any one of embodiments B1 to B26, wherein theoligonucleotide overhang identification sequence is specific to lengthof the oligonucleotide overhang and is specific to one or more featuresof the oligonucleotide overhang chosen from (i) a 5′ overhang, (ii) a 3′overhang, (iii) a particular sequence, (iv) a combination of (i) and(iii), or (v) a combination of (ii) and (iii).

B28. The composition of any one of embodiments B1 to B27, wherein anoligonucleotide species comprises no overhang and comprises anoligonucleotide overhang identification sequence specific to having nooverhang.

B29. The composition of any one of embodiments B1 to B28, wherein theoligonucleotide overhang comprises DNA nucleotides.

B30. The composition of any one of embodiments B1 to B28, wherein theoligonucleotide overhang consists of DNA nucleotides.

B31. The composition of any one of embodiments B1 to B28, wherein theoligonucleotide overhang comprises RNA nucleotides.

B32. The composition of any one of embodiments B1 to B28, wherein theoligonucleotide overhang consists of RNA nucleotides.

C1. A method for modifying nucleic acid ends, comprising:

-   -   (a) combining a nucleic acid composition comprising target        nucleic acids and a plurality of oligonucleotide species,        wherein:        -   (i) each oligonucleotide in the plurality of oligonucleotide            species comprises one or more cleavage sites capable of            being cleaved under cleavage conditions,        -   (ii) some or all of the target nucleic acids comprise an            overhang,        -   (iii) some or all of the oligonucleotides in the plurality            of oligonucleotide species comprise two strands and a first            overhang and a second overhang, wherein each overhang is            capable of hybridizing to a target nucleic acid overhang,            wherein each oligonucleotide species has a unique overhang            sequence and length,        -   (iv) each oligonucleotide in the plurality of            oligonucleotide species comprises at least two            oligonucleotide overhang identification sequences specific            to one or more features of the first and second            oligonucleotide overhangs, and        -   (v) the nucleic acid composition and the plurality of            oligonucleotide species is combined under conditions in            which oligonucleotide overhangs hybridize to target nucleic            acid overhangs having a corresponding length, thereby            forming hybridization products; (b) contacting the            hybridization products under cleavage conditions with one or            more cleavage agents capable of cleaving the hybridization            products at the one or more cleavage sites, thereby forming            cleaved hybridization products; and    -   (c) contacting the cleaved hybridization products with a        strand-displacing polymerase, thereby forming blunt-ended        nucleic acid fragments.

C2. The method of embodiment C1, wherein (c) comprises contacting thecleaved hybridization products with a strand-displacing polymerase andmodified nucleotides, thereby forming blunt-ended nucleic acid fragmentscomprising one or more modified nucleotides.

C3. The method of embodiment C2, wherein the one or more modifiednucleotides comprise a nucleotide conjugated to a member of a bindingpair.

C4. The method of embodiment C2, wherein the one or more modifiednucleotides comprise a nucleotide conjugated to biotin.

C5. The method of any one of embodiments C1 to C4, wherein the one ormore cleavage sites comprise nucleotides chosen from uracil anddeoxyuridine.

C6. The method of any one of embodiments C1 to C5, wherein the one ormore cleavage agents comprise an endonuclease.

C7. The method of any one of embodiments C1 to C5, wherein the one ormore cleavage agents comprise a DNA glycosylase.

C8. The method of any one of embodiments C1 to C7, wherein the one ormore cleavage agents comprise an endonuclease and a DNA glycosylase.

C9. The method of embodiment C8, wherein the one or more cleavage agentscomprise a mixture of uracil DNA glycosylase (UDG) and endonucleaseVIII.

C10. The method of any one of embodiments C1 to C4, wherein the one ormore cleavage sites comprise a restriction enzyme recognition site.

C11. The method of embodiment C10, wherein the one or more cleavageagents comprise a restriction enzyme.

C12. The method of embodiment C10, wherein the one or more cleavageagents comprise a rare-cutter restriction enzyme.

C13. Reserved.

C14. Reserved.

C15. The method of any one of embodiments C1 to C4, wherein the one ormore cleavage sites comprise one or more RNA nucleotides.

C16. The method of any one of embodiments C1 to C4, wherein the one ormore cleavage sites comprise a single stranded portion comprising one ormore RNA nucleotides.

C17. The method of embodiment C15 or C16, wherein the one or morecleavage agents comprise a ribonuclease (RNAse).

C18. The method of embodiment C17, wherein the RNAse is anendoribonuclease.

C19. The method of embodiment C17 or C18, wherein the RNAse is chosenfrom one or more of RNAse A, RNAse E, RNAse F, RNAse H, RNAse III, RNAseL, RNAse P, RNAse PhyM, RNAse T1, RNAse T2, RNAse U2, and RNAse V.

C20. The method of any one of embodiments C1 to C4, wherein the one ormore cleavage sites comprise a photo-cleavable spacer.

C21. The method of embodiment C20, wherein the one or more cleavageagents comprise ultraviolet (UV) light.

C22. The method of any one of embodiments C1 to C21, wherein theplurality of oligonucleotide species comprises oligonucleotides having a5′ overhang and oligonucleotides having a 3′ overhang.

C23. The method of any one of embodiments C1 to C22, wherein theplurality of oligonucleotide species comprises oligonucleotides having a5′ overhang, oligonucleotides having a 3′ overhang, and oligonucleotideshaving no overhang.

C24. The method of any one of embodiments C1 to C23, wherein theoligonucleotides that comprise an overhang comprise a duplex portion,and a single-stranded overhang on each end.

C25. The method of any one of embodiments C1 to C24, wherein theoligonucleotides that comprise an overhang comprise a duplex portion,and a single-stranded overhang on each end, wherein the single-strandedoverhang on the first end is identical in length and identical insequence to the overhang on the second end.

C26. The method of any one of embodiments C1 to C25, wherein theoligonucleotide overhang comprises 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11,12, 13, 14, 15, 16, 17, 18, 19 or 20 nucleotides.

C27. The method of any one of embodiments C1 to C26, wherein theoligonucleotides in the plurality of oligonucleotide species compriseoligonucleotide overhangs having different sequences for a particularoverhang length.

C28. The method of embodiment C27, wherein the oligonucleotides in theplurality of oligonucleotide species comprise all possible overhangsequence combinations for a particular overhang length.

C29. The method of embodiment C28, wherein the oligonucleotides in theplurality of oligonucleotide species comprise all possible overhangsequence combinations for each overhang length.

C30. The method of embodiment C27, C28 or C29, wherein theoligonucleotide overhang sequences are random.

C31. The method of any one of embodiments C1 to C30, wherein theoligonucleotides that comprise no overhang comprise a dual blunt-endedduplex portion.

C32. The method of any one of embodiments C1 to C31, wherein an end ofan oligonucleotide is capable of being covalently linked to an end of atarget nucleic acid to which the oligonucleotide is hybridized in thehybridization products.

C33. The method of embodiment C32, wherein the 3′ end of anoligonucleotide strand is capable of being covalently linked to the 5′end of a strand in the target nucleic acid to which the oligonucleotideis hybridized in a hybridization product.

C34. The method of any one of embodiments C1 to C33, wherein theoligonucleotide overhang identification sequence is specific to lengthof the oligonucleotide overhang.

C35. The method of any one of embodiments C1 to C34, wherein theoligonucleotide overhang identification sequence is specific to lengthof the oligonucleotide overhang and is specific to one or more featuresof the oligonucleotide overhang chosen from (i) a 5′ overhang, (ii) a 3′overhang, (iii) a particular sequence, (iv) a combination of (i) and(iii), or (v) a combination of (ii) and (iii).

C36. The method of any one of embodiments C1 to C35, wherein some of thetarget nucleic acids comprise no overhang.

C37. The method of any one of embodiments C1 to C36, wherein anoligonucleotide species comprises no overhang and comprises anoligonucleotide overhang identification sequence specific to having nooverhang.

C38. The method of any one of embodiments C1 to C37, wherein the targetnucleic acids comprising an overhang comprise a duplex region and asingle-stranded overhang.

C39. The method of any one of embodiments C1 to C38, wherein each targetnucleic acid comprising an overhang comprises an overhang at one end oran overhang at both ends.

C40. The method of any one of embodiments C1 to C39, wherein an end, orboth ends, of each target nucleic acid comprising an overhangindependently comprises a 5′ overhang or a 3′ overhang.

C41. The method of any one of embodiments C1 to C40, wherein the targetnucleic acids comprise deoxyribonucleic acid (DNA) fragments.

C42. The method of embodiment C41, wherein the DNA fragments areobtained from cells.

C43. The method of embodiment C41 or C42, wherein the DNA fragmentscomprise genomic DNA fragments.

C44. The method of any one of embodiments C1 to C40, wherein the targetnucleic acids comprise ribonucleic acid (RNA) fragments.

C45. The method of embodiment C44, wherein the RNA fragments areobtained from cells.

C46. The method of any one of embodiments C1 to C45, wherein the targetnucleic acids comprise cell-free nucleic acid fragments.

C47. The method of any one of embodiments C1 to C46, wherein the targetnucleic acids comprise circulating cell-free nucleic acid fragments.

C48. The method of any one of embodiments C1 to C47, wherein theoverhangs in target nucleic acids are native overhangs.

C49. The method of any one of embodiments C1 to C48, wherein theoverhangs in target nucleic acids are unmodified overhangs.

C50. The method of any one of embodiments C1 to C49, wherein the targetnucleic acids are not modified in length prior to combining with theplurality of oligonucleotide species.

C51. The method of any one of embodiments C1 to C50, comprisingpreparing the nucleic acid composition prior to (a), by a processconsisting essentially of isolating nucleic acid from a sample, therebygenerating the nucleic acid composition.

C52. The method of any one of embodiments C1 to C51, comprising exposingthe hybridization products to conditions under which an end of thetarget nucleic acid is joined to an end of the oligonucleotide to whichit is hybridized.

C53. The method of embodiment C52, comprising contacting thehybridization products with an agent comprising a ligase activity underconditions in which an end of a target nucleic acid is covalently linkedto an end of the oligonucleotide to which the target nucleic acid ishybridized.

C54. The method of any one of embodiments C1 to C53, comprising prior to(a), contacting the target nucleic acid composition with an agentcomprising a phosphatase activity under conditions in which targetnucleic acids are dephosphorylated, thereby generating adephosphorylated target nucleic acid composition.

C55. The method of embodiment C54, comprising prior to (a), contactingthe dephosphorylated target nucleic acid composition with an agentcomprising a phosphoryl transfer activity under conditions in which a 5′phosphate is added to a 5′ end of target nucleic acids.

C56. The method of any one of embodiments C1 to C55, comprising prior to(a), contacting the plurality of oligonucleotide species with an agentcomprising a phosphatase activity under conditions in which theoligonucleotides are dephosphorylated, thereby generating a plurality ofdephosphorylated oligonucleotide species.

C57. The method of embodiment C56, comprising prior to (a), contactingthe dephosphorylated oligonucleotide species with an agent comprising aphosphoryl transfer activity under conditions in which a 5′ phosphate isadded to a 5′ end of oligonucleotide species.

C58. The method of any one of embodiments C1 to C57, wherein the targetnucleic acids are obtained from a sample from a subject.

C59. The method of embodiment C58, wherein the subject is a human.

C60. The method of any one of embodiments C1 to C59, comprising prior to(a), separating the target nucleic acids according to fragment length.

C61. The method of embodiment C60, wherein target nucleic acids havingfragment lengths of less than about 500 bp are combined with theplurality of oligonucleotide species.

C62. The method of embodiment C60, wherein target nucleic acids havingfragment lengths of about 500 bp or more are combined with the pluralityof oligonucleotide species.

C63. The method of any one of embodiments C1 to C62, wherein theoligonucleotide overhang comprises DNA nucleotides.

C64. The method of any one of embodiments C1 to C62, wherein theoligonucleotide overhang consists of DNA nucleotides.

C65. The method of any one of embodiments C1 to C62, wherein theoligonucleotide overhang comprises RNA nucleotides.

C66. The method of any one of embodiments C1 to C62, wherein theoligonucleotide overhang consists of RNA nucleotides.

C67. The method of embodiment C65 or C66, comprising contacting thehybridization products with an agent comprising a RNA ligase activityunder conditions in which an end of a target nucleic acid is covalentlylinked to an end of the oligonucleotide to which the target nucleic acidis hybridized.

C68. The method of any one of embodiments C65 to C67, comprisingcontacting the hybridization products with an agent comprising an RNAseactivity under conditions in which double-stranded RNA duplexes aredigested.

D1. A composition comprising a plurality of oligonucleotide species,wherein:

-   -   (a) each oligonucleotide in the plurality of oligonucleotide        species comprises one or more cleavage sites capable of being        cleaved under cleavage conditions;    -   (b) some or all of the oligonucleotides in the plurality of        oligonucleotide species comprise two strands and a first        overhang and a second overhang, wherein each overhang is capable        of hybridizing to a target nucleic acid overhang, wherein each        oligonucleotide species has a unique overhang sequence and        length; and    -   (c) each oligonucleotide in the plurality of oligonucleotide        species comprises at least two oligonucleotide overhang        identification sequences specific to one or more features of the        first and second oligonucleotide overhangs.

D2. The composition of embodiment D1, wherein the one or more cleavagesites comprise nucleotides chosen from uracil and deoxyuridine.

D3. The composition of embodiment D1, wherein the one or more cleavagesites comprise a restriction enzyme recognition site.

D4. The composition of embodiment D1, wherein the one or more cleavagesites comprise one or more RNA nucleotides.

D5. The composition of embodiment D4, wherein the one or more cleavagesites comprise a single stranded portion comprising one or more RNAnucleotides.

D6. The composition of embodiment D1, wherein the one or more cleavagesites comprise a photo-cleavable spacer.

D7. The composition of any one of embodiments D1 to D6, wherein theplurality of oligonucleotide species comprises oligonucleotides having a5′ overhang and oligonucleotides having a 3′ overhang.

D8. The composition of any one of embodiments D1 to D7, wherein theplurality of oligonucleotide species comprises oligonucleotides having a5′ overhang, oligonucleotides having a 3′ overhang, and oligonucleotideshaving no overhang.

D9. The composition of any one of embodiments D1 to D8, wherein theoligonucleotides that comprise an overhang comprise a duplex portion,and a single-stranded overhang on each end.

D10. The composition of any one of embodiments D1 to D9, wherein theoligonucleotides that comprise an overhang comprise a duplex portion,and a single-stranded overhang on each end, wherein the single-strandedoverhang on the first end is identical in length and identical insequence to the overhang on the second end.

D11. The composition of any one of embodiments D1 to D10, wherein theoligonucleotide overhang comprises 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11,12, 13, 14, 15, 16, 17, 18, 19 or 20 nucleotides.

D12. The composition of any one of embodiments D1 to D11, wherein theoligonucleotides in the plurality of oligonucleotide species compriseoligonucleotide overhangs having different sequences for a particularoverhang length.

D13. The composition of embodiment D12, wherein the oligonucleotides inthe plurality of oligonucleotide species comprise all possible overhangsequence combinations for a particular overhang length.

D14. The composition of embodiment D13, wherein the oligonucleotides inthe plurality of oligonucleotide species comprise all possible overhangsequence combinations for each overhang length.

D15. The composition of embodiment D12, D13 or D14, wherein theoligonucleotide overhang sequences are random.

D16. The composition of any one of embodiments D1 to D15, wherein theoligonucleotides that comprise no overhang comprise a dual blunt-endedduplex portion.

D17. The composition of any one of embodiments D1 to D16, wherein theoligonucleotide overhang identification sequence is specific to lengthof the oligonucleotide overhang.

D18. The composition of any one of embodiments D1 to D17, wherein theoligonucleotide overhang identification sequence is specific to lengthof the oligonucleotide overhang and is specific to one or more featuresof the oligonucleotide overhang chosen from (i) a 5′ overhang, (ii) a 3′overhang, (iii) a particular sequence, (iv) a combination of (i) and(iii), or (v) a combination of (ii) and (iii).

D19. The composition of any one of embodiments D1 to D18, wherein anoligonucleotide species comprises no overhang and comprises anoligonucleotide overhang identification sequence specific to having nooverhang.

D20. The composition of any one of embodiments D1 to D19, wherein theoligonucleotide overhang comprises DNA nucleotides.

D21. The composition of any one of embodiments D1 to D19, wherein theoligonucleotide overhang consists of DNA nucleotides.

D22. The composition of any one of embodiments D1 to D19, wherein theoligonucleotide overhang comprises RNA nucleotides.

D23. The composition of any one of embodiments D1 to D19, wherein theoligonucleotide overhang consists of RNA nucleotides.

E1. A method for modifying nucleic acid ends, comprising:

-   -   (a) combining a nucleic acid composition comprising target        nucleic acids and a plurality of oligonucleotide species,        wherein:        -   (i) some or all of the oligonucleotides in the plurality of            oligonucleotide species comprise two strands and an overhang            at a first end and one or more modified nucleotides at a            second end, wherein the overhang is capable of hybridizing            to a target nucleic acid overhang, wherein each            oligonucleotide species has a unique overhang sequence and            length,        -   (ii) some or all of the target nucleic acids comprise an            overhang,        -   (iii) each oligonucleotide in the plurality of            oligonucleotide species comprises an oligonucleotide            overhang identification sequence specific to one or more            features of the oligonucleotide overhang, and        -   (iv) the nucleic acid composition and the plurality of            oligonucleotide species is combined under conditions in            which oligonucleotide overhangs hybridize to target nucleic            acid overhangs having a corresponding length, thereby            forming hybridization products; and    -   (b) contacting the hybridization products with a        strand-displacing polymerase, thereby forming blunt-ended        nucleic acid fragments.

E2. The method of embodiment E1, wherein the oligonucleotides having oneor more modified nucleotides at a second end comprise an unpairedmodified nucleotide at the second end.

E3. The method of embodiment E1 or E2, wherein the oligonucleotideshaving one or modified nucleotides at a second end comprise the one ormore modified nucleotides at the end of the strand having a 3′ terminus.

E4. The method of embodiment E1 or E2, wherein the oligonucleotideshaving one or modified nucleotides at a second end comprise the one ormore modified nucleotides at the end of the strand having a 5′ terminus.

E5. The method of any one of embodiments E1 to E4, wherein the one ormore modified nucleotides are capable of blocking hybridization to anucleotide in a target nucleic acid.

E6. The method of any one of embodiments E1 to E5, wherein the one ormore modified nucleotides are capable of blocking ligation to anucleotide in a target nucleic acid.

E7. The method of any one of embodiments E1 to E6, wherein the one ormore modified nucleotides comprise a modified nucleotide incapable ofbinding to a natural nucleotide.

E8. The method of any one of embodiments E1 to E7, wherein the one ormore modified nucleotides comprise one or more modified nucleotideschosen from an isodeoxy-base, a dideoxy-base, an inverted dideoxy-base,a spacer, and an amino linker.

E9. The method of any one of embodiments E1 to E8, wherein the one ormore modified nucleotides comprise an isodeoxy-base.

E10. The method of embodiment E9, wherein the one or more modifiednucleotides comprise isodeoxy-guanine (iso-dG).

E11. The method of embodiment E10, wherein the one or more modifiednucleotides comprise isodeoxy-cytosine (iso-dC).

E12. The method of any one of embodiments E1 to E8, wherein the one ormore modified nucleotides comprise a dideoxy-base.

E13. The method of embodiment E12, wherein the one or more modifiednucleotides comprise dideoxy-cytosine.

E14. The method of any one of embodiments E1 to E8, wherein the one ormore modified nucleotides comprise an inverted dideoxy-base.

E15. The method of embodiment E14, wherein the one or more modifiednucleotides comprise inverted dideoxy-thymine.

E16. The method of any one of embodiments E1 to E8, wherein the one ormore modified nucleotides comprise a spacer.

E17. The method of embodiment E16, wherein the one or more modifiednucleotides comprise a C3 spacer.

E18. The method of any one of embodiments E1 to E17, wherein theblunt-ended nucleic acid fragments formed in (b) comprise no modifiednucleotides.

E19. The method of any one of embodiments E1 to E18, wherein theplurality of oligonucleotide species comprises oligonucleotides having a5′ overhang and oligonucleotides having a 3′ overhang.

E20. The method of any one of embodiments E1 to E19, wherein theplurality of oligonucleotide species comprises oligonucleotides having a5′ overhang, oligonucleotides having a 3′ overhang, and oligonucleotideshaving no overhang.

E21. The method of any one of embodiments E1 to E20, wherein theoligonucleotides that comprise an overhang comprise a duplex portion, anoverhang at the first end and at least one unpaired modified nucleotideat the second end.

E22. The method of any one of embodiments E1 to E21, wherein theoligonucleotide overhang comprises 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11,12, 13, 14, 15, 16, 17, 18, 19 or 20 nucleotides.

E23. The method of any one of embodiments E1 to E22, wherein theoligonucleotides in the plurality of oligonucleotide species compriseoligonucleotide overhangs having different sequences for a particularoverhang length.

E24. The method of embodiment E23, wherein the oligonucleotides in theplurality of oligonucleotide species comprise all possible overhangsequence combinations for a particular overhang length.

E25. The method of embodiment E24, wherein the oligonucleotides in theplurality of oligonucleotide species comprise all possible overhangsequence combinations for each overhang length.

E26. The method of embodiment E23, E24 or E25, wherein theoligonucleotide overhang sequences are random.

E27. The method of any one of embodiments E1 to E26, wherein theoligonucleotides that comprise no overhang comprise a duplex portionhaving a blunt end at a first end and at least one unpaired modifiednucleotide at a second end.

E28. The method of any one of embodiments E1 to E27, wherein an end ofan oligonucleotide is capable of being covalently linked to an end of atarget nucleic acid to which the oligonucleotide is hybridized in thehybridization products.

E29. The method of embodiment E28, wherein the 3′ end of anoligonucleotide strand is capable of being covalently linked to the 5′end of a strand in the target nucleic acid to which the oligonucleotideis hybridized in a hybridization product.

E30. The method of any one of embodiments E1 to E29, wherein theoligonucleotide overhang identification sequence is specific to lengthof the oligonucleotide overhang.

E31. The method of any one of embodiments E1 to E30, wherein theoligonucleotide overhang identification sequence is specific to lengthof the oligonucleotide overhang and is specific to one or more featuresof the oligonucleotide overhang chosen from (i) a 5′ overhang, (ii) a 3′overhang, (iii) a particular sequence, (iv) a combination of (i) and(iii), or (v) a combination of (ii) and (iii).

E32. The method of any one of embodiments E1 to E31, wherein some of thetarget nucleic acids comprise no overhang.

E33. The method of any one of embodiments E1 to E32, wherein anoligonucleotide species comprises no overhang and comprises anoligonucleotide overhang identification sequence specific to having nooverhang.

E34. The method of any one of embodiments E1 to E33, wherein the targetnucleic acids comprising an overhang comprise a duplex region and asingle-stranded overhang.

E35. The method of any one of embodiments E1 to E34, wherein each targetnucleic acid comprising an overhang comprises an overhang at one end oran overhang at both ends.

E36. The method of any one of embodiments E1 to E35, wherein an end, orboth ends, of each target nucleic acid comprising an overhangindependently comprises a 5′ overhang or a 3′ overhang.

E37. The method of any one of embodiments E1 to E36, wherein the targetnucleic acids comprise deoxyribonucleic acid (DNA) fragments.

E38. The method of embodiment E37, wherein the DNA fragments areobtained from cells.

E39. The method of embodiment E37 or E38, wherein the DNA fragmentscomprise genomic DNA fragments.

E40. The method of any one of embodiments E1 to E36, wherein the targetnucleic acids comprise ribonucleic acid (RNA) fragments.

E41. The method of embodiment E40, wherein the RNA fragments areobtained from cells.

E42. The method of any one of embodiments E1 to E41, wherein the targetnucleic acids comprise cell-free nucleic acid fragments.

E43. The method of any one of embodiments E1 to E42, wherein the targetnucleic acids comprise circulating cell-free nucleic acid fragments.

E44. The method of any one of embodiments E1 to E43, wherein theoverhangs in target nucleic acids are native overhangs.

E45. The method of any one of embodiments E1 to E44, wherein theoverhangs in target nucleic acids are unmodified overhangs.

E46. The method of any one of embodiments E1 to E45, wherein the targetnucleic acids are not modified in length prior to combining with theplurality of oligonucleotide species.

E47. The method of any one of embodiments E1 to E46, comprisingpreparing the nucleic acid composition prior to (a), by a processconsisting essentially of isolating nucleic acid from a sample, therebygenerating the nucleic acid composition.

E48. The method of any one of embodiments E1 to E47, comprising exposingthe hybridization products to conditions under which an end of thetarget nucleic acid is joined to an end of the oligonucleotide to whichit is hybridized.

E49. The method of embodiment E48, comprising contacting thehybridization products with an agent comprising a ligase activity underconditions in which an end of a target nucleic acid is covalently linkedto an end of the oligonucleotide to which the target nucleic acid ishybridized.

E50. The method of any one of embodiments E1 to E49, comprising prior to(a), contacting the target nucleic acid composition with an agentcomprising a phosphatase activity under conditions in which targetnucleic acids are dephosphorylated, thereby generating adephosphorylated target nucleic acid composition.

E51. The method of embodiment E50, comprising prior to (a), contactingthe dephosphorylated target nucleic acid composition with an agentcomprising a phosphoryl transfer activity under conditions in which a 5′phosphate is added to a 5′ end of target nucleic acids.

E52. The method of any one of embodiments E1 to E51, comprising prior to(a), contacting the plurality of oligonucleotide species with an agentcomprising a phosphatase activity under conditions in which theoligonucleotides are dephosphorylated, thereby generating a plurality ofdephosphorylated oligonucleotide species.

E53. The method of embodiment E52, comprising prior to (a), contactingthe dephosphorylated oligonucleotide species with an agent comprising aphosphoryl transfer activity under conditions in which a 5′ phosphate isadded to a 5′ end of oligonucleotide species.

E54. The method of any one of embodiments E1 to E53, wherein the targetnucleic acids are obtained from a sample from a subject.

E55. The method of embodiment E54, wherein the subject is a human.

E56. The method of any one of embodiments E1 to E55, comprising prior to(a), separating the target nucleic acids according to fragment length.

E57. The method of embodiment E56, wherein target nucleic acids havingfragment lengths of less than about 500 bp are combined with theplurality of oligonucleotide species.

E58. The method of embodiment E56, wherein target nucleic acids havingfragment lengths of about 500 bp or more are combined with the pluralityof oligonucleotide species.

E59. The method of any one of embodiments E1 to E58, wherein theoligonucleotide overhang comprises DNA nucleotides.

E60. The method of any one of embodiments E1 to E58, wherein theoligonucleotide overhang consists of DNA nucleotides.

E61. The method of any one of embodiments E1 to E58, wherein theoligonucleotide overhang comprises RNA nucleotides.

E62. The method of any one of embodiments E1 to E58, wherein theoligonucleotide overhang consists of RNA nucleotides.

E63. The method of embodiment E61 or E62, comprising contacting thehybridization products with an agent comprising a RNA ligase activityunder conditions in which an end of a target nucleic acid is covalentlylinked to an end of the oligonucleotide to which the target nucleic acidis hybridized.

E64. The method of any one of embodiments E61 to E63, comprisingcontacting the hybridization products with an agent comprising an RNAseactivity under conditions in which double-stranded RNA duplexes aredigested.

F1. A composition comprising a plurality of oligonucleotide species,wherein:

-   -   (a) some or all of the oligonucleotides in the plurality of        oligonucleotide species comprise two strands and an overhang at        a first end and one or more modified nucleotides at a second        end, wherein the overhang is capable of hybridizing to a target        nucleic acid overhang, wherein each oligonucleotide species has        a unique overhang sequence and length; and    -   (b) each oligonucleotide in the plurality of oligonucleotide        species comprises an oligonucleotide overhang identification        sequence specific to one or more features of the oligonucleotide        overhang.

F2. The composition of embodiment F1, wherein the oligonucleotideshaving one or more modified nucleotides at a second end comprise anunpaired modified nucleotide at the second end.

F3. The composition of embodiment F1 or F2, wherein the oligonucleotideshaving one or modified nucleotides at a second end comprise the one ormore modified nucleotides at the end of the strand having a 3′ terminus.

F4. The composition of embodiment F1 or F2, wherein the oligonucleotideshaving one or modified nucleotides at a second end comprise the one ormore modified nucleotides at the end of the strand having a 5′ terminus.

F5. The composition of any one of embodiments F1 to F4, wherein the oneor more modified nucleotides are capable of blocking hybridization to anucleotide in a target nucleic acid.

F6. The composition of any one of embodiments F1 to F5, wherein the oneor more modified nucleotides are capable of blocking ligation to anucleotide in a target nucleic acid.

F7. The composition of any one of embodiments F1 to F6, wherein the oneor more modified nucleotides comprise a modified nucleotide incapable ofbinding to a natural nucleotide.

F8. The composition of any one of embodiments F1 to F7, wherein the oneor more modified nucleotides comprise one or more modified nucleotideschosen from an isodeoxy-base, a dideoxy-base, an inverted dideoxy-base,a spacer, and an amino linker.

F9. The composition of any one of embodiments F1 to F8, wherein the oneor more modified nucleotides comprise an isodeoxy-base.

F10. The composition of embodiment F9, wherein the one or more modifiednucleotides comprise isodeoxy-guanine (iso-dG).

F11. The composition of embodiment F9, wherein the one or more modifiednucleotides comprise isodeoxy-cytosine (iso-dC).

F12. The composition of any one of embodiments F1 to F8, wherein the oneor more modified nucleotides comprise a dideoxy-base.

F13. The composition of embodiment F12, wherein the one or more modifiednucleotides comprise dideoxy-cytosine.

F14. The composition of any one of embodiments F1 to F8, wherein the oneor more modified nucleotides comprise an inverted dideoxy-base.

F15. The composition of embodiment F14, wherein the one or more modifiednucleotides comprise inverted dideoxy-thymine.

F16. The composition of any one of embodiments F1 to F8, wherein the oneor more modified nucleotides comprise a spacer.

F17. The composition of embodiment F16, wherein the one or more modifiednucleotides comprise a C3 spacer.

F18. The composition of any one of embodiments F1 to F17, wherein theplurality of oligonucleotide species comprises oligonucleotides having a5′ overhang and oligonucleotides having a 3′ overhang.

F19. The composition of any one of embodiments F1 to F18, wherein theplurality of oligonucleotide species comprises oligonucleotides having a5′ overhang, oligonucleotides having a 3′ overhang, and oligonucleotideshaving no overhang.

F20. The composition of any one of embodiments F1 to F19, wherein theoligonucleotides that comprise an overhang comprise a duplex portion, anoverhang at the first end and at least one unpaired modified nucleotideat the second end.

F21. The composition of any one of embodiments F1 to F20, wherein theoligonucleotide overhang comprises 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11,12, 13, 14, 15, 16, 17, 18, 19 or 20 nucleotides.

F22. The composition of any one of embodiments F1 to F21, wherein theoligonucleotides in the plurality of oligonucleotide species compriseoligonucleotide overhangs having different sequences for a particularoverhang length.

F23. The composition of embodiment F22, wherein the oligonucleotides inthe plurality of oligonucleotide species comprise all possible overhangsequence combinations for a particular overhang length.

F24. The composition of embodiment F23, wherein the oligonucleotides inthe plurality of oligonucleotide species comprise all possible overhangsequence combinations for each overhang length.

F25. The composition of embodiment F22, F23 or F24, wherein theoligonucleotide overhang sequences are random.

F26. The composition of any one of embodiments F1 to F25, wherein theoligonucleotides that comprise no overhang comprise a duplex portionhaving a blunt end at a first end and at least one unpaired modifiednucleotide at a second end.

F27. The composition of any one of embodiments F1 to F26, wherein theoligonucleotide overhang identification sequence is specific to lengthof the oligonucleotide overhang.

F28. The composition of any one of embodiments F1 to F27, wherein theoligonucleotide overhang identification sequence is specific to lengthof the oligonucleotide overhang and is specific to one or more featuresof the oligonucleotide overhang chosen from (i) a 5′ overhang, (ii) a 3′overhang, (iii) a particular sequence, (iv) a combination of (i) and(iii), or (v) a combination of (ii) and (iii).

F29. The composition of any one of embodiments F1 to F28, wherein anoligonucleotide species comprises no overhang and comprises anoligonucleotide overhang identification sequence specific to having nooverhang.

F30. The composition of any one of embodiments F1 to F29, wherein theoligonucleotide overhang comprises DNA nucleotides.

F31. The composition of any one of embodiments F1 to F29, wherein theoligonucleotide overhang consists of DNA nucleotides.

F32. The composition of any one of embodiments F1 to F29, wherein theoligonucleotide overhang comprises RNA nucleotides.

F33. The composition of any one of embodiments F1 to F29, wherein theoligonucleotide overhang consists of RNA nucleotides.

G1. A method for modifying nucleic acid ends, comprising:

-   -   (a) combining a nucleic acid composition comprising target        nucleic acids and a plurality of oligonucleotide species,        wherein:        -   (i) the oligonucleotides in the plurality of oligonucleotide            species comprise two strands and an overhang at a first end,            wherein the first end overhang comprises a palindromic            sequence;        -   (ii) some or all of the oligonucleotides in the plurality of            oligonucleotide species comprise an overhang at a second            end, wherein the second end overhang is capable of            hybridizing to a target nucleic acid overhang, wherein each            oligonucleotide species has a unique second end overhang            sequence and length,        -   (iii) some or all of the target nucleic acids comprise an            overhang,        -   (iv) each oligonucleotide in the plurality of            oligonucleotide species comprises an oligonucleotide            overhang identification sequence specific to one or more            features of the second end overhang,        -   (v) each oligonucleotide in the plurality of oligonucleotide            species comprises one or more modified nucleotides, and        -   (vi) the nucleic acid composition and the plurality of            oligonucleotide species is combined under conditions in            which first end overhangs hybridize to other first end            overhangs and second end overhangs hybridize to target            nucleic acid overhangs having a corresponding length,            thereby forming circular hybridization products;    -   (b) contacting the hybridization products with an exonuclease,        thereby generating exonuclease-treated hybridization products;    -   (c) shearing the exonuclease-treated hybridization products,        thereby generating sheared exonuclease-treated hybridization        products; and    -   (d) separating fragments comprising a sequence in the        oligonucleotide species from fragments not comprising a sequence        in the oligonucleotide species, thereby generating separated,        sheared, exonuclease-treated hybridization products.

G2. The method of embodiment G1, wherein the one or more modifiednucleotides comprise a nucleotide conjugated to a first member of abinding pair.

G3. The method of embodiment G1 or G2, wherein the one or more modifiednucleotides comprise a nucleotide conjugated to biotin.

G4. The method of any one of embodiments G1 to G3, wherein the first endoverhang comprises the one or more modified nucleotides.

G5. The method of any one of embodiments G1 to G4, wherein theseparating in (d) comprises contacting the sheared exonuclease-treatedhybridization products with a second member of a binding pair.

G6. The method of embodiment G5, wherein the second member of a bindingpair is streptavidin conjugated to a solid support.

G7. The method of any one of embodiments G1 to G6, wherein the pluralityof oligonucleotide species comprises oligonucleotides having a 5′overhang and oligonucleotides having a 3′ overhang, and combinationsthereof.

G8. The method of any one of embodiments G1 to G7, wherein one or moreoligonucleotide species have a 5′ overhang at a first end.

G9. The method of any one of embodiments G1 to G8, wherein one or moreoligonucleotide species have a 3′ overhang at a first end.

G10. The method of any one of embodiments G1 to G9, wherein one or moreoligonucleotide species have a 5′ overhang at a second end.

G11. The method of any one of embodiments G1 to G10, wherein one or moreoligonucleotide species have a 3′ overhang at a second end.

G12. The method of any one of embodiments G1 to G11, wherein one or moreoligonucleotide species have no overhang at a second end.

G13. The method of any one of embodiments G1 to G12, wherein theplurality of oligonucleotide species comprises oligonucleotidesindependently having a first end 5′ overhang or a first end 3′ overhang,and a second end 5′ overhang, a second end 3′ overhang, or a second endcomprising no overhang.

G14. The method of any one of embodiments G1 to G13, wherein the secondend overhang comprises 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14,15, 16, 17, 18, 19 or 20 nucleotides.

G15. The method of any one of embodiments G1 to G14, wherein theoligonucleotides in the plurality of oligonucleotide species comprisesecond end overhangs having different sequences for a particularoverhang length.

G16. The method of embodiment G15, wherein the second end overhangs inthe plurality of oligonucleotide species comprise all possible overhangsequence combinations for a particular overhang length.

G17. The method of embodiment G16, wherein the second end overhangs inthe plurality of oligonucleotide species comprise all possible overhangsequence combinations for each overhang length.

G18. The method of embodiment G15, G16 or G17, wherein the second endoverhang sequences are random.

G19. The method of any one of embodiments G1 to G18, wherein an end ofan oligonucleotide is capable of being covalently linked to an end of atarget nucleic acid to which the oligonucleotide is hybridized in thehybridization products.

G20. The method of any one of embodiments G1 to G19, wherein an end of afirst end overhang is capable of being covalently linked to an end of anoligonucleotide species comprising a first end to which the first endoverhang is hybridized in the hybridization products.

G21. The method of embodiment G20, wherein the 3′ end of anoligonucleotide strand is capable of being covalently linked to the 5′end of a strand in the target nucleic acid to which the oligonucleotideis hybridized in a hybridization product.

G22. The method of any one of embodiments G1 to G21, wherein theoligonucleotide overhang identification sequence is specific to lengthof the second end overhang.

G23. The method of any one of embodiments G1 to G22, wherein theoligonucleotide overhang identification sequence is specific to lengthof the second end overhang and is specific to one or more features ofthe second end overhang chosen from (i) a 5′ overhang, (ii) a 3′overhang, (iii) a particular sequence, (iv) a combination of (i) and(iii), or (v) a combination of (ii) and (iii).

G24. The method of any one of embodiments G1 to G23, wherein some of thetarget nucleic acids comprise no overhang.

G25. The method of any one of embodiments G1 to G24, wherein anoligonucleotide species comprises no second end overhang and comprisesan oligonucleotide overhang identification sequence specific to havingno overhang.

G26. The method of any one of embodiments G1 to G25, wherein the targetnucleic acids comprising an overhang comprise a duplex region and asingle-stranded overhang.

G27. The method of any one of embodiments G1 to G26, wherein each targetnucleic acid comprising an overhang comprises an overhang at one end oran overhang at both ends.

G28. The method of any one of embodiments G1 to G27, wherein an end, orboth ends, of each target nucleic acid comprising an overhangindependently comprises a 5′ overhang or a 3′ overhang.

G29. The method of any one of embodiments G1 to G28, wherein the targetnucleic acids comprise deoxyribonucleic acid (DNA) fragments.

G30. The method of embodiment G29, wherein the DNA fragments areobtained from cells.

G31. The method of embodiment G29 or G30, wherein the DNA fragmentscomprise genomic DNA fragments.

G32. The method of any one of embodiments G1 to G28, wherein the targetnucleic acids comprise ribonucleic acid (RNA) fragments.

G33. The method of embodiment G32, wherein the RNA fragments areobtained from cells.

G34. The method of any one of embodiments G1 to G33, wherein the targetnucleic acids comprise cell-free nucleic acid fragments.

G35. The method of any one of embodiments G1 to G34, wherein the targetnucleic acids comprise circulating cell-free nucleic acid fragments.

G36. The method of any one of embodiments G1 to G35, wherein theoverhangs in target nucleic acids are native overhangs.

G37. The method of any one of embodiments G1 to G36, wherein theoverhangs in target nucleic acids are unmodified overhangs.

G38. The method of any one of embodiments G1 to G37, wherein the targetnucleic acids are not modified in length prior to combining with theplurality of oligonucleotide species.

G39. The method of any one of embodiments G1 to G38, comprisingpreparing the nucleic acid composition prior to (a), by a processconsisting essentially of isolating nucleic acid from a sample, therebygenerating the nucleic acid composition.

G40. The method of any one of embodiments G1 to G39, comprising exposingthe hybridization products to conditions under which an end of thetarget nucleic acid is joined to an end of the oligonucleotide to whichit is hybridized.

G41. The method of embodiment G40, comprising contacting thehybridization products with an agent comprising a ligase activity underconditions in which an end of a target nucleic acid is covalently linkedto an end of the oligonucleotide to which the target nucleic acid ishybridized.

G42. The method of any one of embodiments G1 to G41, comprising prior to(a), contacting the target nucleic acid composition with an agentcomprising a phosphatase activity under conditions in which targetnucleic acids are dephosphorylated, thereby generating adephosphorylated target nucleic acid composition.

G43. The method of embodiment G42, comprising prior to (a), contactingthe dephosphorylated target nucleic acid composition with an agentcomprising a phosphoryl transfer activity under conditions in which a 5′phosphate is added to a 5′ end of target nucleic acids.

G44. The method of any one of embodiments G1 to G43, comprising prior to(a), contacting the plurality of oligonucleotide species with an agentcomprising a phosphatase activity under conditions in which theoligonucleotides are dephosphorylated, thereby generating a plurality ofdephosphorylated oligonucleotide species.

G45. The method of embodiment G44, comprising prior to (a), contactingthe dephosphorylated oligonucleotide species with an agent comprising aphosphoryl transfer activity under conditions in which a 5′ phosphate isadded to a 5′ end of oligonucleotide species.

G46. The method of any one of embodiments G1 to G45, wherein the targetnucleic acids are obtained from a sample from a subject.

G47. The method of embodiment G46, wherein the subject is a human.

G48. The method of any one of embodiments G1 to G47, comprising prior to(a), separating the target nucleic acids according to fragment length.

G49. The method of embodiment G48, wherein target nucleic acids havingfragment lengths of less than about 500 bp are combined with theplurality of oligonucleotide species.

G50. The method of embodiment G48, wherein target nucleic acids havingfragment lengths of about 500 bp or more are combined with the pluralityof oligonucleotide species.

G51. The method of any one of embodiments G1 to G50, wherein the secondend overhang comprises DNA nucleotides.

G52. The method of any one of embodiments G1 to G50, wherein the secondend overhang consists of DNA nucleotides.

G53. The method of any one of embodiments G1 to G50, wherein the secondend overhang comprises RNA nucleotides.

G54. The method of any one of embodiments G1 to G50, wherein the secondend overhang consists of RNA nucleotides.

G55. The method of embodiment G53 or G54, comprising contacting thehybridization products with an agent comprising a RNA ligase activityunder conditions in which an end of a target nucleic acid is covalentlylinked to an end of the oligonucleotide to which the target nucleic acidis hybridized.

G56. The method of any one of embodiments G53 to G55, comprisingcontacting the hybridization products with an agent comprising an RNAseactivity under conditions in which double-stranded RNA duplexes aredigested.

H1. A composition comprising a plurality of oligonucleotide species,wherein:

-   -   (a) the oligonucleotides in the plurality of oligonucleotide        species comprise two strands and an overhang at a first end,        wherein the first end overhang comprises a palindromic sequence;    -   (b) some or all of the oligonucleotides in the plurality of        oligonucleotide species comprise an overhang at a second end,        wherein the second end overhang is capable of hybridizing to a        target nucleic acid overhang, wherein each oligonucleotide        species has a unique second end overhang sequence and length;    -   (c) each oligonucleotide in the plurality of oligonucleotide        species comprises an oligonucleotide overhang identification        sequence specific to one or more features of the second end        overhang; and    -   (d) each oligonucleotide in the plurality of oligonucleotide        species comprises one or more modified nucleotides.

H2. The composition of embodiment H1, wherein the one or more modifiednucleotides comprise a nucleotide conjugated to a first member of abinding pair.

H3. The composition of embodiment H1 or H2, wherein the one or moremodified nucleotides comprise a nucleotide conjugated to biotin.

H4. The composition of any one of embodiments H1 to H3, wherein thefirst end overhang comprises the one or more modified nucleotides.

H5. The composition of any one of embodiments H1 to H4, wherein theseparating in (d) comprises contacting the sheared exonuclease-treatedhybridization products with a second member of a binding pair.

H6. The composition of embodiment H5, wherein the second member of abinding pair is streptavidin conjugated to a solid support.

H7. The composition of any one of embodiments H1 to H6, wherein theplurality of oligonucleotide species comprises oligonucleotides having a5′ overhang and oligonucleotides having a 3′ overhang, and combinationsthereof.

H8. The composition of any one of embodiments H1 to H7, wherein one ormore oligonucleotide species have a 5′ overhang at a first end.

H9. The composition of any one of embodiments H1 to H8, wherein one ormore oligonucleotide species have a 3′ overhang at a first end.

H10. The composition of any one of embodiments H1 to H9, wherein one ormore oligonucleotide species have a 5′ overhang at a second end.

H11. The composition of any one of embodiments H1 to H10, wherein one ormore oligonucleotide species have a 3′ overhang at a second end.

H12. The composition of any one of embodiments H1 to H11, wherein one ormore oligonucleotide species have no overhang at a second end.

H13. The composition of any one of embodiments H1 to H12, wherein theplurality of oligonucleotide species comprises oligonucleotidesindependently having a first end 5′ overhang or a first end 3′ overhang,and a second end 5′ overhang, a second end 3′ overhang, or a second endcomprising no overhang.

H14. The composition of any one of embodiments H1 to H13, wherein thesecond end overhang comprises 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,14, 15, 16, 17, 18, 19 or 20 nucleotides.

H15. The composition of any one of embodiments H1 to H14, wherein theoligonucleotides in the plurality of oligonucleotide species comprisesecond end overhangs having different sequences for a particularoverhang length.

H16. The composition of embodiment H15, wherein the second end overhangsin the plurality of oligonucleotide species comprise all possibleoverhang sequence combinations for a particular overhang length.

H17. The composition of embodiment H16, wherein the second end overhangsin the plurality of oligonucleotide species comprise all possibleoverhang sequence combinations for each overhang length.

H18. The composition of embodiment H15, H16 or H17, wherein the secondend overhang sequences are random.

H19. The composition of any one of embodiments H1 to H18, wherein theoligonucleotide overhang identification sequence is specific to lengthof the second end overhang.

H20. The composition of any one of embodiments H1 to H19, wherein theoligonucleotide overhang identification sequence is specific to lengthof the second end overhang and is specific to one or more features ofthe second end overhang chosen from (i) a 5′ overhang, (ii) a 3′overhang, (iii) a particular sequence, (iv) a combination of (i) and(iii), or (v) a combination of (ii) and (iii).

H21. The composition of any one of embodiments H1 to H20, wherein anoligonucleotide species comprises no second end overhang and comprisesan oligonucleotide overhang identification sequence specific to havingno overhang.

H22. The composition of any one of embodiments H1 to H21, wherein thesecond end overhang comprises DNA nucleotides.

H23. The composition of any one of embodiments H1 to H21, wherein thesecond end overhang consists of DNA nucleotides.

H24. The composition of any one of embodiments H1 to H21, wherein thesecond end overhang comprises RNA nucleotides.

H25. The composition of any one of embodiments H1 to H21, wherein thesecond end overhang consists of RNA nucleotides.

I1. A method for modifying nucleic acid ends, comprising:

-   -   (a) combining a nucleic acid composition comprising target        nucleic acids and a plurality of oligonucleotide species,        wherein:        -   (i) some or all of the oligonucleotides in the plurality of            oligonucleotide species comprise (1) two strands and an            overhang at a first end and two non-complementary strands at            a second end, or (2) one strand capable of forming a hairpin            structure having a single-stranded loop and an overhang;            wherein the overhang is capable of hybridizing to a target            nucleic acid overhang, wherein each oligonucleotide species            has a unique overhang sequence and length,        -   (ii) some or all of the target nucleic acids comprise an            overhang,        -   (iii) each oligonucleotide in the plurality of            oligonucleotide species comprises an oligonucleotide            overhang identification sequence specific to one or more            features of the oligonucleotide overhang, and        -   (iv) the nucleic acid composition and the plurality of            oligonucleotide species is combined under conditions in            which oligonucleotide overhangs hybridize to target nucleic            acid overhangs having a corresponding length, thereby            forming hybridization products; and    -   (b) contacting the hybridization products with a        strand-displacing polymerase, thereby forming blunt-ended        nucleic acid fragments.

I2. The method of embodiment I1, wherein each oligonucleotide in theplurality of oligonucleotide species consists of one strand capable offorming a hairpin structure having a single-stranded loop.

I3. The method of embodiment I1 or I2, wherein the single stranded loopcomprises a cleavage site.

I4. The method of embodiment I3, wherein the cleavage site comprises oneor more RNA nucleotides.

I5. The method of embodiment I4, wherein the loop comprises two RNAnucleotides.

I6. The method of embodiment I4, wherein the loop comprises three RNAnucleotides.

I7. The method of embodiment I4, wherein the loop comprises four RNAnucleotides.

I8. The method of any one of embodiments I4 to I7, wherein the loopcomprises one or more ribonucleic acid (RNA) nucleotides chosen fromadenine (A), cytosine (C), guanine (G), and uracil (U).

I9. The method of any one of embodiments I4 to I8, wherein the RNAnucleotides comprise guanine (G).

I10. The method of any one of embodiments I4 to I8, wherein the RNAnucleotides consist of guanine (G).

I11. The method of any one of embodiments I4 to I8, wherein the RNAnucleotides comprise cytosine (C).

I12. The method of any one of embodiments I4 to I8, wherein the RNAnucleotides consist of cytosine (C).

I13. The method of any one of embodiments I4 to I8, wherein the RNAnucleotides comprise adenine (A).

I14. The method of any one of embodiments I4 to I8, wherein the RNAnucleotides consist of adenine (A).

I15. The method of any one of embodiments I4 to I8, wherein the RNAnucleotides consist of adenine (A), cytosine (C), and guanine (G).

I16. The method of any one of embodiments I4 to I8, wherein the RNAnucleotides consist of adenine (A) and cytosine (C).

I17. The method of any one of embodiments I4 to I8, wherein the RNAnucleotides consist of adenine (A) and guanine (G).

I18. The method of any one of embodiments I4 to I8, wherein the RNAnucleotides consist of cytosine (C) and guanine (G).

I19. The method of any one of embodiments I1 to I18, wherein theplurality of oligonucleotide species comprises oligonucleotides having a5′ overhang and oligonucleotides having a 3′ overhang.

I20. The method of any one of embodiments I1 to I19, wherein theplurality of oligonucleotide species comprises oligonucleotides having a5′ overhang, oligonucleotides having a 3′ overhang, and oligonucleotideshaving no overhang.

I21. The method of any one of embodiments I1 to I20, wherein theoligonucleotides that comprise an overhang comprise a single-strandedloop, a duplex portion, and a single-stranded overhang.

I22. The method of any one of embodiments I1 to I20, wherein theoligonucleotides that comprise an overhang comprise an overhang at afirst end, a duplex portion, and two non-complementary strands at asecond end.

I23. The method of any one of embodiments I1 to I22, wherein theoligonucleotide overhang comprises 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11,12, 13, 14, 15, 16, 17, 18, 19 or 20 nucleotides.

I24. The method of any one of embodiments I1 to I23, wherein theoligonucleotides in the plurality of oligonucleotide species compriseoligonucleotide overhangs having different sequences for a particularoverhang length.

I25. The method of embodiment I24, wherein the oligonucleotides in theplurality of oligonucleotide species comprise all possible overhangsequence combinations for a particular overhang length.

I26. The method of embodiment I25, wherein the oligonucleotides in theplurality of oligonucleotide species comprise all possible overhangsequence combinations for each overhang length.

I27. The method of embodiment I24, I25 or I26, wherein theoligonucleotide overhang sequences are random.

I28. The method of any one of embodiments I1 to I27, wherein theoligonucleotides that comprise no overhang comprise a single-strandedloop and a duplex portion.

I29. The method of any one of embodiments I1 to I27, wherein theoligonucleotides that comprise no overhang comprise a blunt-ended firstend, a duplex portion, and two non-complementary strands at a secondend.

I30. The method of any one of embodiments I1 to I29, wherein an end ofan oligonucleotide is capable of being covalently linked to an end of atarget nucleic acid to which the oligonucleotide is hybridized in thehybridization products.

I31. The method of embodiment I30, wherein the 3′ end of anoligonucleotide strand is capable of being covalently linked to the 5′end of a strand in the target nucleic acid to which the oligonucleotideis hybridized in a hybridization product.

I32. The method of any one of embodiments I1 to I31, wherein thehybridization products comprise a duplex region and at least onesingle-stranded loop.

I33. The method of any one of embodiments I1 to I32, wherein thehybridization products comprise a duplex region and a single-strandedloop at each end.

I34. The method of any one of embodiments I1 to I31, wherein thehybridization products comprise a duplex region and at least one endcomprising two non-complementary strands.

I35. The method of any one of embodiments I1 to I31 and I134, whereinthe hybridization products comprise a duplex region and twonon-complementary strands at each end.

I36. The method of any one of embodiments I3 to I35, comprisingcontacting the hybridization products under cleavage conditions with oneor more cleavage agents capable of cleaving the hybridization productswithin the hairpin loop at the cleavage site, thereby forming cleavedhybridization products.

I37. The method of any one of embodiments I4 to I36, comprisingcontacting the hybridization products under cleavage conditions with oneor more cleavage agents capable of cleaving the hybridization productswithin the hairpin loop at the RNA nucleotide(s), thereby formingcleaved hybridization products.

I38. The method of embodiment I37, wherein the one or more cleavageagents are capable of cleaving the hybridization products within thehairpin loop at the RNA nucleotide(s) and are not capable of cleavingthe hybridization products within the duplex region.

I39. The method of any one of embodiments I36 to I38, wherein the one ormore cleavage agents comprise a ribonuclease (RNAse).

I40. The method of embodiment I39, wherein the RNAse is anendoribonuclease.

I41. The method of embodiment I39 or I40, wherein the RNAse is chosenfrom one or more of RNAse A, RNAse E, RNAse F, RNAse H, RNAse III, RNAseL, RNAse P, RNAse PhyM, RNAse T1, RNAse T2, RNAse U2, and RNAse V.

I42. The method of any one of embodiments I1 to I41, wherein theoligonucleotide overhang identification sequence is specific to lengthof the oligonucleotide overhang.

I43. The method of any one of embodiments I1 to I42, wherein theoligonucleotide overhang identification sequence is specific to lengthof the oligonucleotide overhang and is specific to one or more featuresof the oligonucleotide overhang chosen from (i) a 5′ overhang, (ii) a 3′overhang, (iii) a particular sequence, (iv) a combination of (i) and(iii), or (v) a combination of (ii) and (iii).

I44. The method of any one of embodiments I1 to I43, wherein some of thetarget nucleic acids comprise no overhang.

I45. The method of any one of embodiments I1 to I44, wherein anoligonucleotide species comprises no overhang and comprises anoligonucleotide overhang identification sequence specific to having nooverhang.

I46. The method of any one of embodiments I1 to I45, wherein the targetnucleic acids comprising an overhang comprise a duplex region and asingle-stranded overhang.

I47. The method of any one of embodiments I1 to I46, wherein each targetnucleic acid comprising an overhang comprises an overhang at one end oran overhang at both ends.

I48. The method of any one of embodiments I1 to I47, wherein an end, orboth ends, of each target nucleic acid comprising an overhangindependently comprises a 5′ overhang or a 3′ overhang.

I49. The method of any one of embodiments I1 to I48, wherein the targetnucleic acids comprise deoxyribonucleic acid (DNA) fragments.

I50. The method of embodiment I49, wherein the DNA fragments areobtained from cells.

I51. The method of embodiment I49 or 150, wherein the DNA fragmentscomprise genomic DNA fragments.

I52. The method of any one of embodiments I1 to I48, wherein the targetnucleic acids comprise ribonucleic acid (RNA) fragments.

I53. The method of embodiment I52, wherein the RNA fragments areobtained from cells.

I54. The method of any one of embodiments I1 to I53, wherein the targetnucleic acids comprise cell-free nucleic acid fragments.

I55. The method of any one of embodiments I1 to I54, wherein the targetnucleic acids comprise circulating cell-free nucleic acid fragments.

I56. The method of any one of embodiments I1 to I55, wherein theoverhangs in target nucleic acids are native overhangs.

I57. The method of any one of embodiments I1 to I56, wherein theoverhangs in target nucleic acids are unmodified overhangs.

I58. The method of any one of embodiments I1 to I57, wherein the targetnucleic acids are not modified in length prior to combining with theplurality of oligonucleotide species.

I59. The method of any one of embodiments I1 to I58, comprisingpreparing the nucleic acid composition prior to (a), by a processconsisting essentially of isolating nucleic acid from a sample, therebygenerating the nucleic acid composition.

I60. The method of any one of embodiments I1 to I59, comprising exposingthe hybridization products to conditions under which an end of thetarget nucleic acid is joined to an end of the oligonucleotide to whichit is hybridized.

I61. The method of embodiment I60, comprising contacting thehybridization products with an agent comprising a ligase activity underconditions in which an end of a target nucleic acid is covalently linkedto an end of the oligonucleotide to which the target nucleic acid ishybridized.

I62. The method of any one of embodiments I1 to I61, comprising prior to(a), contacting the target nucleic acid composition with an agentcomprising a phosphatase activity under conditions in which targetnucleic acids are dephosphorylated, thereby generating adephosphorylated target nucleic acid composition.

I63. The method of embodiment I62, comprising prior to (a), contactingthe dephosphorylated target nucleic acid composition with an agentcomprising a phosphoryl transfer activity under conditions in which a 5′phosphate is added to a 5′ end of target nucleic acids.

I64. The method of any one of embodiments I1 to I63, comprising prior to(a), contacting the plurality of oligonucleotide species with an agentcomprising a phosphatase activity under conditions in which theoligonucleotides are dephosphorylated, thereby generating a plurality ofdephosphorylated oligonucleotide species.

I65. The method of embodiment I64, comprising prior to (a), contactingthe dephosphorylated oligonucleotide species with an agent comprising aphosphoryl transfer activity under conditions in which a 5′ phosphate isadded to a 5′ end of oligonucleotide species.

I66. The method of any one of embodiments I1 to I65, wherein the targetnucleic acids are obtained from a sample from a subject.

I67. The method of embodiment I66, wherein the subject is a human.

I68. The method of any one of embodiments I1 to I67, comprising prior to(a), separating the target nucleic acids according to fragment length.

I69. The method of embodiment I68, wherein target nucleic acids havingfragment lengths of less than about 500 bp are combined with theplurality of oligonucleotide species.

I70. The method of embodiment I68, wherein target nucleic acids havingfragment lengths of about 500 bp or more are combined with the pluralityof oligonucleotide species.

I71. The method of any one of embodiments I1 to I70, wherein theoligonucleotide overhang comprises DNA nucleotides.

I72. The method of any one of embodiments I1 to I70, wherein theoligonucleotide overhang consists of DNA nucleotides.

I73. The method of any one of embodiments I1 to I70, wherein theoligonucleotide overhang comprises RNA nucleotides.

I74. The method of any one of embodiments I1 to I70, wherein theoligonucleotide overhang consists of RNA nucleotides.

I75. The method of embodiment I73 or 174, comprising contacting thehybridization products with an agent comprising a RNA ligase activityunder conditions in which an end of a target nucleic acid is covalentlylinked to an end of the oligonucleotide to which the target nucleic acidis hybridized.

I76. The method of any one of embodiments 173 to 175, comprisingcontacting the hybridization products with an agent comprising an RNAseactivity under conditions in which double-stranded RNA duplexes aredigested.

J1. A composition comprising a plurality of oligonucleotide species,wherein:

-   -   (a) some or all of the oligonucleotides in the plurality of        oligonucleotide species comprise (i) two strands and an overhang        at a first end and two non-complementary strands at a second        end, or (ii) one strand capable of forming a hairpin structure        having a single-stranded loop and an overhang; wherein the        overhang is capable of hybridizing to a target nucleic acid        overhang, wherein each oligonucleotide species has a unique        overhang sequence and length; and    -   (b) each oligonucleotide in the plurality of oligonucleotide        species comprises an oligonucleotide overhang identification        sequence specific to one or more features of the oligonucleotide        overhang.

J2. The composition of embodiment J1, wherein each oligonucleotide inthe plurality of oligonucleotide species consists of one strand capableof forming a hairpin structure having a single-stranded loop.

J3. The composition of embodiment J1 or J2, wherein the single strandedloop comprises a cleavage site.

J4. The composition of embodiment J3, wherein the cleavage sitecomprises one or more RNA nucleotides.

J5. The composition of embodiment J4, wherein the loop comprises two RNAnucleotides.

J6. The composition of embodiment J4, wherein the loop comprises threeRNA nucleotides.

J7. The composition of embodiment J4, wherein the loop comprises fourRNA nucleotides.

J8. The composition of any one of embodiments J4 to J7, wherein the loopcomprises one or more ribonucleic acid (RNA) nucleotides chosen fromadenine (A), cytosine (C), guanine (G), and uracil (U).

J9. The composition of any one of embodiments J4 to J8, wherein the RNAnucleotides comprise guanine (G).

J10. The composition of any one of embodiments J4 to J8, wherein the RNAnucleotides consist of guanine (G).

J11. The composition of any one of embodiments J4 to J8, wherein the RNAnucleotides comprise cytosine (C).

J12. The composition of any one of embodiments J4 to J8, wherein the RNAnucleotides consist of cytosine (C).

J13. The composition of any one of embodiments J4 to J8, wherein the RNAnucleotides comprise adenine (A).

J14. The composition of any one of embodiments J4 to J8, wherein the RNAnucleotides consist of adenine (A).

J15. The composition of any one of embodiments J4 to J8, wherein the RNAnucleotides consist of adenine (A), cytosine (C), and guanine (G).

J16. The composition of any one of embodiments J4 to J8, wherein the RNAnucleotides consist of adenine (A) and cytosine (C).

J17. The composition of any one of embodiments J4 to J8, wherein the RNAnucleotides consist of adenine (A) and guanine (G).

J18. The composition of any one of embodiments J4 to J8, wherein the RNAnucleotides consist of cytosine (C) and guanine (G).

J19. The composition of any one of embodiments J1 to J18, wherein theplurality of oligonucleotide species comprises oligonucleotides having a5′ overhang and oligonucleotides having a 3′ overhang.

J20. The composition of any one of embodiments J1 to J19, wherein theplurality of oligonucleotide species comprises oligonucleotides having a5′ overhang, oligonucleotides having a 3′ overhang, and oligonucleotideshaving no overhang.

J21. The composition of any one of embodiments J1 to J20, wherein theoligonucleotides that comprise an overhang comprise a single-strandedloop, a duplex portion, and a single-stranded overhang.

J22. The composition of any one of embodiments J1 to J21, wherein theoligonucleotides that comprise an overhang comprise an overhang at afirst end, a duplex portion, and two non-complementary strands at asecond end.

J23. The composition of any one of embodiments J1 to J22, wherein theoligonucleotide overhang comprises 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11,12, 13, 14, 15, 16, 17, 18, 19 or 20 nucleotides.

J24. The composition of any one of embodiments J1 to J23, wherein theoligonucleotides in the plurality of oligonucleotide species compriseoligonucleotide overhangs having different sequences for a particularoverhang length.

J25. The composition of embodiment J24, wherein the oligonucleotides inthe plurality of oligonucleotide species comprise all possible overhangsequence combinations for a particular overhang length.

J26. The composition of embodiment J25, wherein the oligonucleotides inthe plurality of oligonucleotide species comprise all possible overhangsequence combinations for each overhang length.

J27. The composition of embodiment J24, J25 or J26, wherein theoligonucleotide overhang sequences are random.

J28. The composition of any one of embodiments J1 to J27, wherein theoligonucleotides that comprise no overhang comprise a single-strandedloop and a duplex portion.

J29. The composition of any one of embodiments J1 to J27, wherein theoligonucleotides that comprise no overhang comprise a blunt-ended firstend, a duplex portion, and two non-complementary strands at a secondend.

J30. The composition of any one of embodiments J1 to J29, wherein theoligonucleotide overhang identification sequence is specific to lengthof the oligonucleotide overhang.

J31. The composition of any one of embodiments J1 to J30, wherein theoligonucleotide overhang identification sequence is specific to lengthof the oligonucleotide overhang and is specific to one or more featuresof the oligonucleotide overhang chosen from (i) a 5′ overhang, (ii) a 3′overhang, (iii) a particular sequence, (iv) a combination of (i) and(iii), or (v) a combination of (ii) and (iii).

J32. The composition of any one of embodiments J1 to J31, wherein anoligonucleotide species comprises no overhang and comprises anoligonucleotide overhang identification sequence specific to having nooverhang.

J33. The composition of any one of embodiments J1 to J32, wherein theoligonucleotide overhang comprises DNA nucleotides.

J34. The composition of any one of embodiments J1 to J32, wherein theoligonucleotide overhang consists of DNA nucleotides.

J35. The composition of any one of embodiments J1 to J32, wherein theoligonucleotide overhang comprises RNA nucleotides.

J36. The composition of any one of embodiments J1 to J32, wherein theoligonucleotide overhang consists of RNA nucleotides.

K1. A kit, comprising:

-   -   the composition of any one of embodiments B1 to B32; and    -   instructions for using the composition to produce a nucleic acid        library.

K2. The kit of embodiment K1, further comprising an agent comprising aphosphatase activity.

K3. The kit of embodiment K1 or K2, further comprising an agentcomprising a phosphoryl transfer activity.

K4. The kit of any one of embodiments K1 to K3, further comprising anagent comprising a ligase activity.

K5. The kit of any one of embodiments K1 to K4, further comprising oneor more cleavage agents.

K6. The kit of embodiment K5, wherein the one or more cleavage agentscomprise a ribonuclease (RNAse).

K7. The kit of embodiment K6, wherein the RNAse is an endoribonuclease.

L1. A kit, comprising:

-   -   the composition of any one of embodiments D1 to D23; and    -   instructions for using the composition to modify nucleic acid        ends.

L2. The kit of embodiment L1, further comprising an agent comprising aphosphatase activity.

L3. The kit of embodiment L1 or L2, further comprising an agentcomprising a phosphoryl transfer activity.

L4. The kit of any one of embodiments L1 to L3, further comprising anagent comprising a ligase activity.

L5. The kit of any one of embodiments L1 to L4, further comprising oneor more cleavage agents.

L6. The kit of any one of embodiments L1 to L5, wherein the one or morecleavage agents comprise an endonuclease.

L7. The kit of any one of embodiments L1 to L5, wherein the one or morecleavage agents comprise a DNA glycosylase.

L8. The kit of any one of embodiments L1 to L7, wherein the one or morecleavage agents comprise an endonuclease and a DNA glycosylase.

L9. The kit of embodiment L8, wherein the one or more cleavage agentscomprise a mixture of uracil DNA glycosylase (UDG) and endonucleaseVIII.

L10. The kit of any one of embodiments L1 to L9, further comprising astrand-displacing polymerase.

L11. The kit of any one of embodiments L1 to L10, further comprisingmodified nucleotides.

L12. The kit of embodiment L11, wherein the modified nucleotidescomprise a nucleotide conjugated to a first member of a binding pair.

L13. The kit of embodiment L11 or L12, wherein the modified nucleotidescomprise a nucleotide conjugated to biotin.

L14. The kit of embodiment L12 or L13, further comprising a secondmember of a binding pair conjugated to a solid support.

L15. The kit of embodiment L14, wherein the second member of a bindingpair is streptavidin.

M1. A kit, comprising:

-   -   the composition of any one of embodiments F1 to F33; and    -   instructions for using the composition to modify nucleic acid        ends.

M2. The kit of embodiment M1, further comprising an agent comprising aphosphatase activity.

M3. The kit of embodiment M1 or M2, further comprising an agentcomprising a phosphoryl transfer activity.

M4. The kit of any one of embodiments M1 to M3, further comprising anagent comprising a ligase activity.

M5. The kit of any one of embodiments M1 to M4, further comprising astrand-displacing polymerase.

N1. A kit, comprising:

-   -   the composition of any one of embodiments H1 to H25; and    -   instructions for using the composition to modify nucleic acid        ends.

N2. The kit of embodiment N1, further comprising an agent comprising aphosphatase activity.

N3. The kit of embodiment N1 or N2, further comprising an agentcomprising a phosphoryl transfer activity.

N4. The kit of any one of embodiments N1 to N3, further comprising anagent comprising a ligase activity.

N5. The kit of any one of embodiments N1 to N4, further comprising anexonuclease.

N6. The kit of any one of embodiments N1 to N5, further comprising ashearing agent.

N7. The kit of any one of embodiments N1 to N6, further comprising amember of a binding pair conjugated to a solid support.

N8. The kit of embodiment N7, the member of a binding pair isstreptavidin.

O1. A kit comprising:

-   -   the composition of any one of embodiments J1 to J36; and    -   instructions for using the composition to modify nucleic acid        ends.

O2. The kit of embodiment O1, further comprising an agent comprising aphosphatase activity.

O3. The kit of embodiment O1 or O2, further comprising an agentcomprising a phosphoryl transfer activity.

O4. The kit of any one of embodiments O1 to O3, further comprising anagent comprising a ligase activity.

O5. The kit of any one of embodiments O1 to O4, further comprising astrand-displacing polymerase.

O6. The kit of any one of embodiments O1 to O5, further comprising oneor more cleavage agents.

O7. The kit of embodiment O6, wherein the one or more cleavage agentscomprise a ribonuclease (RNAse).

O8. The kit of embodiment O7, wherein the RNAse is an endoribonuclease.

P1. A method of assaying a population of nucleic acids, comprising:

-   -   assaying nucleic acid overhangs of a population of nucleic acids        in a sample, thereby generating an overhang profile of the        population; and    -   based on the overhang profile, determining one or more        characteristics of the sample.

P2. The method of embodiment P1, wherein the assaying comprisescontacting oligonucleotides to the population of nucleic acids.

P2.1 The method of embodiment P2, wherein some or all of theoligonucleotides comprise an overhang capable of hybridizing to anucleic acid overhang, wherein each oligonucleotide species has a uniqueoverhang sequence and length.

P3. The method of embodiment P2 or P2.1, wherein the oligonucleotidescomprise overhang identification sequences.

P3.1 The method of embodiment P3, wherein each overhang identificationsequence is specific to one or more features of the oligonucleotideoverhang.

P3.2 The method of any one of embodiments P2 to P3.1, wherein some orall of the oligonucleotides comprise two strands, and an overhang at afirst end and two non-complementary strands at a second end.

P3.3. The method of any one of embodiments P2 to P3.1, wherein some orall of the oligonucleotides comprise one strand capable of forming ahairpin structure having a single-stranded loop and an overhang.

P4. The method of any one of embodiments P1 to P3.3, wherein the one ormore characteristics of the sample comprise a disease state.

P5. The method of embodiment P4, wherein the disease state comprises acancer type or a cancer stage.

P5.1 The method of embodiment P5, wherein the cancer type isgastrointestinal cancer.

P6. The method of embodiment P4, wherein the disease state comprises achange in a rate or mode of cell death.

P7. The method of embodiment P6, wherein the change is associated with aparticular cell type or organ type.

P8. The method of any one of embodiments P1 to P3.3, wherein the one ormore characteristics of the sample comprise a microbiome profile.

P9. The method of any one of embodiments P1 to P3.3, wherein the one ormore characteristics of the sample comprise radiation exposure.

P10. The method of any one of embodiments P1 to P3.3, wherein the one ormore characteristics of the sample comprise nuclease activity.

P11. The method of embodiment P10, wherein the nuclease activitycomprises nucleic acid-guided nuclease activity.

P12. The method of embodiment P11, wherein the nucleic acid-guidednuclease activity comprises CRISPR/Cas-system protein activity.

P13. The method of any one of embodiments P1 to P3.3, wherein the one ormore characteristics of the sample comprise topoisomerase activity.

P14. The method of any one of embodiments P1 to P13, further comprising,prior to the assaying, inhibiting enzyme activity.

P15. The method of embodiment P14, wherein the enzyme activity comprisesnuclease activity.

P16. The method of any one of embodiments P1 to P15, wherein theassaying comprises hybridization, thereby generating hybridizationproducts.

P17. The method of any one of embodiments P1 to P16, wherein theassaying comprises sequencing the hybridization products, oramplification products thereof, by a sequencing process, therebygenerating sequence reads.

P18. The method of embodiment P17, wherein the sequence reads compriseforward sequence reads and reverse sequence reads.

P19. The method of embodiment P18, comprising quantifying the sequencereads thereby generating a sequence read quantification, wherein thereverse sequence reads are quantified, and the forward sequence readsare excluded from the quantification.

P20. The method of embodiment P18 or P19, wherein the overhang profileis generated according to the reverse sequence reads.

P21. The method of embodiment P18, comprising analyzing overhanginformation associated with overhang identification sequences thatindicate presence of an overhang for the reverse sequence reads, therebygenerating an analysis.

P22. The method of embodiment P21, comprising omitting from the analysisoverhang information associated with overhang identification sequencesthat indicate presence of an overhang for the forward sequence reads.

P23. The method of embodiment P21 or P22, comprising analyzing overhanginformation associated with overhang identification sequences thatindicate no overhang for the forward sequence reads and the reversesequence reads.

P24. The method of any one of embodiments P1 to P23, wherein theoverhang profile comprises one or more overhang features.

P25. The method of embodiment P24, wherein the one or more overhangfeatures are chosen from one or more of overhang length, overhang type,dinucleotide count, trinucleotide count, tetranucleotide count,dinucleotide percent, trinucleotide percent, tetranucleotide percent, GCcontent, overhang percent, overhang count, percent overhang length, andgenome coordinate.

P26. The method of embodiment P24, wherein the one or more overhangfeatures comprise presence of a particular dinucleotide.

P27. The method of embodiment P26, wherein the one or more overhangfeatures comprise presence of a CG dinucleotide.

P28. The method of any one of embodiments P1 to P27, further comprisingcomparing the overhang profile to a reference overhang profile.

P29. The method of any one of embodiments P1 to P27, further comprisingcomparing the overhang profile to a second overhang profile of a secondsample, wherein the second sample is from the same source as the sampleat a different time point.

P30. The method of any one of embodiments P1 to P29, wherein one or moresteps are performed by a microprocessor.

P31. The method of any one of embodiments P1 to P29, comprising one ormore features of any one of embodiments A1 to A68, C1 to C68, E1 to E64,G1 to G56, I1 to I76, Q1 to Q42, T1 to T58, and W1 to W59.

Q1. A method for modifying nucleic acid ends, comprising:

-   -   combining a nucleic acid composition comprising target nucleic        acids and a plurality of oligonucleotide species, wherein:        -   (a) some or all of the oligonucleotides in the plurality of            oligonucleotide species comprise at least one overhang            comprising RNA nucleotides, wherein the overhang is capable            of hybridizing to a target nucleic acid overhang, wherein            each oligonucleotide species has a unique overhang sequence            and length,        -   (b) some or all of the target nucleic acids comprise an            overhang,        -   (c) each oligonucleotide in the plurality of oligonucleotide            species comprises an oligonucleotide overhang identification            sequence specific to one or more features of the            oligonucleotide overhang, and        -   (d) the nucleic acid composition and the plurality of            oligonucleotide species is combined under conditions in            which oligonucleotide overhangs hybridize to target nucleic            acid overhangs having a corresponding length, thereby            forming hybridization products.

Q2. The method of embodiment Q1, wherein the oligonucleotide overhangconsists of RNA nucleotides.

Q3. The method of embodiment Q1 or Q2, wherein the plurality ofoligonucleotide species comprises oligonucleotides having a 5′ overhangand oligonucleotides having a 3′ overhang.

Q4. The method of any one of embodiments Q1 to Q3, wherein the pluralityof oligonucleotide species comprises oligonucleotides having a 5′overhang, oligonucleotides having a 3′ overhang, and oligonucleotideshaving no overhang.

Q5. The method of any one of embodiments Q1 to Q4, wherein theoligonucleotide overhang comprises 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11,12, 13, 14, 15, 16, 17, 18, 19 or 20 nucleotides.

Q6. The method of any one of embodiments Q1 to Q5, wherein theoligonucleotides in the plurality of oligonucleotide species compriseoligonucleotide overhangs having different sequences for a particularoverhang length.

Q7. The method of embodiment Q6, wherein the oligonucleotides in theplurality of oligonucleotide species comprise all possible overhangsequence combinations for a particular overhang length.

Q8. The method of embodiment Q7, wherein the oligonucleotides in theplurality of oligonucleotide species comprise all possible overhangsequence combinations for each overhang length.

Q9. The method of embodiment Q6, Q7 or Q8, wherein the oligonucleotideoverhang sequences are random.

Q10. The method of any one of embodiments Q1 to Q9, wherein an end of anoligonucleotide is capable of being covalently linked to an end of atarget nucleic acid to which the oligonucleotide is hybridized in thehybridization products.

Q11. The method of embodiment Q10, wherein the 3′ end of anoligonucleotide strand is capable of being covalently linked to the 5′end of a strand in the target nucleic acid to which the oligonucleotideis hybridized in a hybridization product.

Q12. The method of any one of embodiments Q1 to Q11, wherein theoligonucleotide overhang identification sequence is specific to lengthof the oligonucleotide overhang.

Q13. The method of any one of embodiments Q1 to Q12, wherein theoligonucleotide overhang identification sequence is specific to lengthof the oligonucleotide overhang and is specific to one or more featuresof the oligonucleotide overhang chosen from (i) a 5′ overhang, (ii) a 3′overhang, (iii) a particular sequence, (iv) a combination of (i) and(iii), or (v) a combination of (ii) and (iii).

Q14. The method of any one of embodiments Q1 to Q13, wherein some of thetarget nucleic acids comprise no overhang.

Q15. The method of any one of embodiments Q1 to Q14, wherein anoligonucleotide species comprises no overhang and comprises anoligonucleotide overhang identification sequence specific to having nooverhang.

Q16. The method of any one of embodiments Q1 to Q15, wherein the targetnucleic acids comprising an overhang comprise a duplex region and asingle-stranded overhang.

Q17. The method of any one of embodiments Q1 to Q16, wherein each targetnucleic acid comprising an overhang comprises an overhang at one end oran overhang at both ends.

Q18. The method of any one of embodiments Q1 to Q17, wherein an end, orboth ends, of each target nucleic acid comprising an overhangindependently comprises a 5′ overhang or a 3′ overhang.

Q19. The method of any one of embodiments Q1 to Q18, wherein the targetnucleic acids comprise deoxyribonucleic acid (DNA) fragments.

Q20. The method of embodiment Q19, wherein the DNA fragments areobtained from cells.

Q21. The method of embodiment Q19 or Q20, wherein the DNA fragmentscomprise genomic DNA fragments.

Q22. The method of any one of embodiments Q1 to Q18, wherein the targetnucleic acids comprise ribonucleic acid (RNA) fragments.

Q23. The method of embodiment Q22, wherein the RNA fragments areobtained from cells.

Q24. The method of any one of embodiments Q1 to Q23, wherein the targetnucleic acids comprise cell-free nucleic acid fragments.

Q25. The method of any one of embodiments Q1 to Q24, wherein the targetnucleic acids comprise circulating cell-free nucleic acid fragments.

Q26. The method of any one of embodiments Q1 to Q25, wherein theoverhangs in target nucleic acids are native overhangs.

Q27. The method of any one of embodiments Q1 to Q26, wherein theoverhangs in target nucleic acids are unmodified overhangs.

Q28. The method of any one of embodiments Q1 to Q27, wherein the targetnucleic acids are not modified in length prior to combining with theplurality of oligonucleotide species.

Q29. The method of any one of embodiments Q1 to Q28, comprisingpreparing the nucleic acid composition prior to (a), by a processconsisting essentially of isolating nucleic acid from a sample, therebygenerating the nucleic acid composition.

Q30. The method of any one of embodiments Q1 to Q29, comprising exposingthe hybridization products to conditions under which an end of thetarget nucleic acid is joined to an end of the oligonucleotide to whichit is hybridized.

Q31. The method of embodiment Q30, comprising contacting thehybridization products with an agent comprising a ligase activity underconditions in which an end of a target nucleic acid is covalently linkedto an end of the oligonucleotide to which the target nucleic acid ishybridized.

Q32. The method of embodiment Q30, comprising contacting thehybridization products with an agent comprising an RNA ligase activityunder conditions in which an end of a target nucleic acid is covalentlylinked to an end of the oligonucleotide to which the target nucleic acidis hybridized.

Q33. The method of any one of embodiments Q1 to Q32, comprising prior tocombining, contacting the target nucleic acid composition with an agentcomprising a phosphatase activity under conditions in which targetnucleic acids are dephosphorylated, thereby generating adephosphorylated target nucleic acid composition.

Q34. The method of embodiment Q33, comprising prior to combining,contacting the dephosphorylated target nucleic acid composition with anagent comprising a phosphoryl transfer activity under conditions inwhich a 5′ phosphate is added to a 5′ end of target nucleic acids.

Q35. The method of any one of embodiments Q1 to Q34, comprising prior tocombining, contacting the plurality of oligonucleotide species with anagent comprising a phosphatase activity under conditions in which theoligonucleotides are dephosphorylated, thereby generating a plurality ofdephosphorylated oligonucleotide species.

Q36. The method of embodiment Q35, comprising prior to combining,contacting the dephosphorylated oligonucleotide species with an agentcomprising a phosphoryl transfer activity under conditions in which a 5′phosphate is added to a 5′ end of oligonucleotide species.

Q37. The method of any one of embodiments Q1 to Q36, wherein the targetnucleic acids are obtained from a sample from a subject.

Q38. The method of embodiment Q37, wherein the subject is a human.

Q39. The method of any one of embodiments Q1 to Q38, comprising priorcombining, separating the target nucleic acids according to fragmentlength.

Q40. The method of embodiment Q39, wherein target nucleic acids havingfragment lengths of less than about 500 bp are combined with theplurality of oligonucleotide species.

Q41. The method of embodiment Q39, wherein target nucleic acids havingfragment lengths of about 500 bp or more are combined with the pluralityof oligonucleotide species.

Q42. The method of any one of embodiments Q1 to Q41, comprisingcontacting the hybridization products with an agent comprising an RNAseactivity under conditions in which double-stranded RNA duplexes aredigested.

R1. A composition comprising a plurality of oligonucleotide species,wherein:

-   -   (a) some or all of the oligonucleotides in the plurality of        oligonucleotide species comprise at least one overhang        comprising RNA nucleotides, wherein the overhang is capable of        hybridizing to a target nucleic acid overhang, wherein each        oligonucleotide species has a unique overhang sequence and        length; and    -   (b) each oligonucleotide in the plurality of oligonucleotide        species comprises an oligonucleotide overhang identification        sequence specific to one or more features of the oligonucleotide        overhang.

R2. The composition of embodiment R1, wherein the oligonucleotideoverhang consists of RNA nucleotides.

R3. The composition of embodiment R1 or R2, wherein the plurality ofoligonucleotide species comprises oligonucleotides having a 5′ overhangand oligonucleotides having a 3′ overhang.

R4. The composition of any one of embodiments R1 to R3, wherein theplurality of oligonucleotide species comprises oligonucleotides having a5′ overhang, oligonucleotides having a 3′ overhang, and oligonucleotideshaving no overhang.

R5. The composition of any one of embodiments R1 to R4, wherein theoligonucleotide overhang comprises 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11,12, 13, 14, 15, 16, 17, 18, 19 or 20 nucleotides.

R6. The composition of any one of embodiments R1 to R5, wherein theoligonucleotides in the plurality of oligonucleotide species compriseoligonucleotide overhangs having different sequences for a particularoverhang length.

R7. The composition of embodiment R6, wherein the oligonucleotides inthe plurality of oligonucleotide species comprise all possible overhangsequence combinations for a particular overhang length.

R8. The composition of embodiment R7, wherein the oligonucleotides inthe plurality of oligonucleotide species comprise all possible overhangsequence combinations for each overhang length.

R9. The composition of embodiment R6, R7 or R8, wherein theoligonucleotide overhang sequences are random.

R10. The composition of any one of embodiments R1 to R9, wherein theoligonucleotide overhang identification sequence is specific to lengthof the oligonucleotide overhang.

R11. The composition of any one of embodiments R1 to R10, wherein theoligonucleotide overhang identification sequence is specific to lengthof the oligonucleotide overhang and is specific to one or more featuresof the oligonucleotide overhang chosen from (i) a 5′ overhang, (ii) a 3′overhang, (iii) a particular sequence, (iv) a combination of (i) and(iii), or (v) a combination of (ii) and (iii).

R12. The composition of any one of embodiments R1 to R11, wherein anoligonucleotide species comprises no overhang and comprises anoligonucleotide overhang identification sequence specific to having nooverhang.

S1. A kit comprising:

-   -   the composition of any one of embodiments R1 to R12; and    -   instructions for using the composition to modify nucleic acid        ends.

S2. The kit of embodiment S1, further comprising an agent comprising aphosphatase activity.

S3. The kit of embodiment S1 or S2, further comprising an agentcomprising a phosphoryl transfer activity.

S4. The kit of any one of embodiments S1 to S3, further comprising anagent comprising a ligase activity.

S5. The kit of any one of embodiments S1 to S4, further comprising anagent comprising an RNA ligase activity.

S6. The kit of any one of embodiments S1 to S5, further comprising astrand-displacing polymerase.

S7. The kit of any one of embodiments S1 to S6, further comprising oneor more cleavage agents.

S8. The kit of embodiment S7, wherein the one or more cleavage agentscomprise a ribonuclease (RNAse).

S9. The kit of embodiment S8, wherein the RNAse is an endoribonuclease.

S10. The kit of embodiment S8 or S9, wherein the RNAse is RNAse III.

T1. A method for producing a nucleic acid library, comprising:

-   -   a) combining a nucleic acid composition comprising target        nucleic acids and a first pool of oligonucleotide species,        wherein:        -   i) some or all of the target nucleic acids comprise an            overhang,        -   ii) some or all of the oligonucleotides in the first pool of            oligonucleotide species comprise an overhang capable of            hybridizing to a target nucleic acid overhang, wherein each            oligonucleotide species has a unique overhang sequence and            length,        -   iii) each oligonucleotide in the first pool of            oligonucleotide species comprises an oligonucleotide            overhang identification sequence specific to one or more            features of the oligonucleotide overhang,        -   iv) each oligonucleotide in the first pool of            oligonucleotide species comprises a first primer binding            domain, and        -   v) the nucleic acid composition and the first pool of            oligonucleotide species are combined under conditions in            which oligonucleotide overhangs hybridize to target nucleic            acid overhangs having a corresponding length, thereby            forming a first set of combined products;    -   b) cleaving the first set of combined products, thereby forming        cleaved products; and    -   c) combining the cleaved products and a second pool of        oligonucleotide species, wherein:        -   i) each oligonucleotide in the second pool of            oligonucleotide species comprises a first end and a second            end,        -   ii) each oligonucleotide in the second pool of            oligonucleotide species comprises a second primer binding            domain, wherein the first primer binding domain and the            second primer binding domain are different, and        -   iii) the cleaved products and the second pool of            oligonucleotide species are combined under conditions in            which the oligonucleotides in the second pool of            oligonucleotide species attach at the first end to at least            one end of the cleaved products, thereby forming a second            set of combined products.

T1.1 The method of embodiment T1, further comprising:

-   -   d) contacting, under amplification conditions, the second set of        combined products with two or more amplification primer species,        wherein a first primer species comprises a nucleotide sequence        complementary to the first primer binding domain and a second        primer binding domain comprises a nucleotide sequence        complementary to the second primer binding domain, thereby        generating amplification products.

T2. The method of embodiment T1 or T1.1, wherein the target nucleicacids comprise nucleic acid fragments larger than 500 bp.

T3. The method of embodiment T1 or T1.1, wherein the target nucleicacids comprise nucleic acid fragments larger than 1000 bp.

T4. The method of any one of embodiments T1 to T3, wherein (b) comprisescontacting the first set of combined products under cleavage conditionswith one or more cleavage agents capable of cleaving the combinedproducts.

T5. The method of any one of embodiments T1 to T3, wherein (b) comprisesmechanical shearing.

T6. The method of any one of embodiments T1 to T5, wherein some or allof the oligonucleotides in the first pool of oligonucleotide speciescomprise one or more modified nucleotides.

T7. The method of embodiment T6, wherein the one or more modifiednucleotides are capable of blocking attachment to other oligonucleotidesin the pool.

T8. The method of any one of embodiments T1 to T7, wherein some or allof the oligonucleotides in the second pool of oligonucleotide speciescomprise one or more modified nucleotides at the second end.

T9. The method of embodiment T8, wherein the one or more modifiednucleotides are capable of blocking attachment of the second end of theoligonucleotide to the cleaved products.

T10. The method of any one of embodiments T1 to T9, further comprisingsequencing the amplification products by a sequencing process.

T11. The method of embodiment T10, wherein the sequencing processgenerates short sequence reads.

T12. The method of any one of embodiments T1 to T11, wherein the firstpool of oligonucleotide species comprises oligonucleotides having a 5′overhang and oligonucleotides having a 3′ overhang.

T13. The method of any one of embodiments T1 to T12, wherein the firstpool of oligonucleotide species comprises oligonucleotides having a 5′overhang, oligonucleotides having a 3′ overhang, and oligonucleotideshaving no overhang.

T14. The method of embodiment T12 or T13, wherein the oligonucleotidesthat comprise an overhang comprise a duplex portion, and asingle-stranded overhang.

T15. The method of any one of embodiments T12 to T14, wherein theoligonucleotide overhang comprises 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11,12, 13, 14, 15, 16, 17, 18, 19 or 20 nucleotides.

T16. The method of any one of embodiments T1 to T15, wherein theoligonucleotides in the first pool of oligonucleotide species compriseoligonucleotide overhangs having different sequences for a particularoverhang length.

T17. The method of embodiment T16, wherein the oligonucleotides in thefirst pool of oligonucleotide species comprise all possible overhangsequence combinations for a particular overhang length.

T18. The method of embodiment T17, wherein the oligonucleotides in thefirst pool of oligonucleotide species comprise all possible overhangsequence combinations for each overhang length.

T19. The method of embodiment T16, T17 or T18, wherein theoligonucleotide overhang sequences are random.

T20. The method of any one of embodiments T1 to T19, wherein an end ofan oligonucleotide in the first pool of oligonucleotide species iscapable of being covalently linked to an end of a target nucleic acid towhich the oligonucleotide is hybridized in the first set of combinedproducts.

T21. The method of embodiment T20, wherein the 3′ end of anoligonucleotide strand in the first pool of oligonucleotide species iscapable of being covalently linked to the 5′ end of a strand in thetarget nucleic acid to which the oligonucleotide is hybridized in thefirst set of combined products.

T22. The method of any one of embodiments T1 to T21, comprising after(b) repairing the ends of the cleaved products.

T23. The method of any one of embodiments T1 to T22, comprising after(b) adding one or more unpaired nucleotides to the ends of the cleavedproducts.

T24. The method of embodiment T23, wherein the oligonucleotides in thesecond pool of oligonucleotide species comprise one or more nucleotidesat the first end that are complementary to the one or more nucleotidesadded to the cleaved products.

T25. The method of embodiment T24, wherein the oligonucleotides in thesecond pool of oligonucleotide species hybridize at the first end to atleast one end of the cleaved products.

T26. The method of any one of embodiments T1 to T25, wherein an end ofan oligonucleotide in the second pool of oligonucleotide species iscapable of being covalently linked to an end of a cleaved product towhich the oligonucleotide is attached in the second set of combinedproducts.

T27. The method of embodiment T26, wherein the 3′ end of anoligonucleotide strand in the second pool of oligonucleotide species iscapable of being covalently linked to the 5′ end of a strand in thecleaved product to which the oligonucleotide is attached in the secondset of combined products.

T28. The method of any one of embodiments T1 to T27, wherein theoligonucleotides in the second pool of oligonucleotide species do notcomprise an overhang capable of hybridizing to a native target nucleicacid overhang.

T28.1 The method of any one of embodiments T1 to T28, wherein theoligonucleotides in the second pool of oligonucleotide species do notcomprise an oligonucleotide overhang identification sequence.

T29. The method of any one of embodiments T1 to T28.1, wherein theoligonucleotide overhang identification sequence on each oligonucleotidein the first pool of oligonucleotide species is specific to length ofthe oligonucleotide overhang.

T30. The method of any one of embodiments T1 to T29, wherein theoligonucleotide overhang identification sequence on each oligonucleotidein the first pool of oligonucleotide species is specific to length ofthe oligonucleotide overhang and is specific to one or more features ofthe oligonucleotide overhang chosen from (i) a 5′ overhang, (ii) a 3′overhang, (iii) a particular sequence, (iv) a combination of (i) and(iii), or (v) a combination of (ii) and (iii).

T31. The method of any one of embodiments T1 to T30, wherein some of thetarget nucleic acids comprise no overhang.

T32. The method of any one of embodiments T1 to T31, wherein anoligonucleotide species in the first pool of oligonucleotide speciescomprises no overhang and comprises an oligonucleotide overhangidentification sequence specific to having no overhang.

T33. The method of any one of embodiments T1 to T32, wherein the targetnucleic acids comprising an overhang comprise a duplex region and asingle-stranded overhang.

T34. The method of any one of embodiments T1 to T33, wherein each targetnucleic acid comprising an overhang comprises an overhang at one end oran overhang at both ends.

T35. The method of any one of embodiments T1 to T34, wherein an end, orboth ends, of each target nucleic acid comprising an overhangindependently comprises a 5′ overhang or a 3′ overhang.

T36. The method of any one of embodiments T1 to T35, wherein the targetnucleic acids comprise deoxyribonucleic acid (DNA) fragments.

T37. The method of embodiment T36, wherein the DNA fragments areobtained from cells.

T38. The method of embodiment T36 or T37, wherein the DNA fragmentscomprise genomic DNA fragments.

T39. The method of any one of embodiments T1 to T35, wherein the targetnucleic acids comprise ribonucleic acid (RNA) fragments.

T40. The method of embodiment T39, wherein the RNA fragments areobtained from cells.

T41. The method of any one of embodiments T1 to T40, wherein the targetnucleic acids comprise cell-free nucleic acid fragments.

T42. The method of any one of embodiments T1 to T41, wherein the targetnucleic acids comprise circulating cell-free nucleic acid fragments.

T43. The method of any one of embodiments T1 to T42, wherein theoverhangs in target nucleic acids are native overhangs.

T44. The method of any one of embodiments T1 to T43, wherein theoverhangs in target nucleic acids are unmodified overhangs.

T45. The method of any one of embodiments T1 to T44, wherein the targetnucleic acids are not modified in length prior to combining with theplurality of oligonucleotide species.

T46. The method of any one of embodiments T1 to T45, comprisingpreparing the nucleic acid composition prior to (a), by a processconsisting essentially of isolating nucleic acid from a sample, therebygenerating the nucleic acid composition.

T47. The method of any one of embodiments T1 to T46, comprising exposingthe first set of combined products to conditions under which an end ofthe target nucleic acid is joined to an end of the oligonucleotide towhich it is hybridized.

T48. The method of embodiment T47, comprising contacting the first setof combined products with an agent comprising a ligase activity underconditions in which an end of a target nucleic acid is covalently linkedto an end of the oligonucleotide to which the target nucleic acid ishybridized.

T49. The method of any one of embodiments T1 to T48, comprising exposingthe second set of combined products to conditions under which an end ofthe cleaved product is joined to an end of the oligonucleotide to whichit is attached.

T50. The method of embodiment T49, comprising contacting the second setof combined products with an agent comprising a ligase activity underconditions in which an end of a cleaved product is covalently linked toan end of the oligonucleotide to which the target nucleic acid isattached.

T51. The method of any one of embodiments T1 to T50, comprising prior to(a), contacting the target nucleic acid composition with an agentcomprising a phosphatase activity under conditions in which targetnucleic acids are dephosphorylated, thereby generating adephosphorylated target nucleic acid composition.

T52. The method of embodiment T51, comprising prior to (a), contactingthe dephosphorylated target nucleic acid composition with an agentcomprising a phosphoryl transfer activity under conditions in which a 5′phosphate is added to a 5′ end of target nucleic acids.

T53. The method of any one of embodiments T1 to T52, comprising prior to(a), contacting the first pool of oligonucleotide species with an agentcomprising a phosphatase activity under conditions in which theoligonucleotides are dephosphorylated, thereby generating a first poolof dephosphorylated oligonucleotide species.

T54. The method of any one of embodiments T1 to T53, comprising prior to(c), contacting the second pool of oligonucleotide species with an agentcomprising a phosphoryl transfer activity under conditions in which a 5′phosphate is added to a 5′ end of oligonucleotide species at the firstend.

T55. The method of any one of embodiments T1 to T54, wherein the targetnucleic acids are obtained from a sample from a subject.

T56. The method of embodiment T55, wherein the subject is a human.

T57. The method of any one of embodiments T1 to T56, comprising prior to(a), separating the target nucleic acids according to fragment length.

T58. The method of any one of embodiments T1 to T56, wherein the targetnucleic acids are not separated by length prior to (a).

U1. A composition comprising:

-   -   a) a first pool of oligonucleotide species, wherein:        -   i) some or all of the oligonucleotides in the first pool of            oligonucleotide species comprise an overhang capable of            hybridizing to a target nucleic acid overhang, wherein each            oligonucleotide species has a unique overhang sequence and            length,        -   ii) each oligonucleotide in the first pool of            oligonucleotide species comprises an oligonucleotide            overhang identification sequence specific to one or more            features of the oligonucleotide overhang, and        -   iii) each oligonucleotide in the first pool of            oligonucleotide species comprises a first primer binding            domain; and    -   b) a second pool of oligonucleotide species, wherein:        -   i) each oligonucleotide in the second pool of            oligonucleotide species comprises a first end and a second            end, and        -   ii) each oligonucleotide in the second pool of            oligonucleotide species comprises a second primer binding            domain, wherein the first primer binding domain and the            second primer binding domain are different.

U2. The composition of embodiment U1, wherein some or all of theoligonucleotides in the first pool of oligonucleotide species compriseone or more modified nucleotides.

U3. The composition of embodiment U2, wherein the one or more modifiednucleotides are capable of blocking attachment to other oligonucleotidesin the pool.

U4. The composition of any one of embodiments U1 to U3, wherein some orall of the oligonucleotides in the second pool of oligonucleotidespecies comprise one or more modified nucleotides at the second end.

U5. The composition of embodiment U4, wherein the one or more modifiednucleotides are capable of blocking attachment of the second end of theoligonucleotide to cleaved target nucleic acids.

U6. The composition of any one of embodiments U1 to U5, wherein thefirst pool of oligonucleotide species comprises oligonucleotides havinga 5′ overhang and oligonucleotides having a 3′ overhang.

U7. The composition of any one of embodiments U1 to U6, wherein thefirst pool of oligonucleotide species comprises oligonucleotides havinga 5′ overhang, oligonucleotides having a 3′ overhang, andoligonucleotides having no overhang.

U8. The composition of embodiment T12 or T13, wherein theoligonucleotides that comprise an overhang comprise a duplex portion,and a single-stranded overhang.

U9. The composition of any one of embodiments U6 to U8, wherein theoligonucleotide overhang comprises 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11,12, 13, 14, 15, 16, 17, 18, 19 or 20 nucleotides.

U10. The composition of any one of embodiments U1 to U9, wherein theoligonucleotides in the first pool of oligonucleotide species compriseoligonucleotide overhangs having different sequences for a particularoverhang length.

U11. The composition of embodiment U10, wherein the oligonucleotides inthe first pool of oligonucleotide species comprise all possible overhangsequence combinations for a particular overhang length.

U12. The composition of embodiment U11, wherein the oligonucleotides inthe first pool of oligonucleotide species comprise all possible overhangsequence combinations for each overhang length.

U13. The composition of embodiment U10, U11, or U12, wherein theoligonucleotide overhang sequences are random.

U14. The composition of any one of embodiments U1 to U13, wherein theoligonucleotides in the second pool of oligonucleotide species compriseone or more unpaired nucleotides at the first end.

U15. The composition of any one of embodiments U1 to U14, wherein theoligonucleotides in the second pool of oligonucleotide species do notcomprise an overhang capable of hybridizing to a native target nucleicacid overhang.

U15.1 The composition of any one of embodiments U1 to U15, wherein theoligonucleotides in the second pool of oligonucleotide species do notcomprise an oligonucleotide overhang identification sequence.

U16. The composition of any one of embodiments U1 to U15.1, wherein theoligonucleotide overhang identification sequence on each oligonucleotidein the first pool of oligonucleotide species is specific to length ofthe oligonucleotide overhang.

U17. The composition of any one of embodiments U1 to U16, wherein theoligonucleotide overhang identification sequence on each oligonucleotidein the first pool of oligonucleotide species is specific to length ofthe oligonucleotide overhang and is specific to one or more features ofthe oligonucleotide overhang chosen from (i) a 5′ overhang, (ii) a 3′overhang, (iii) a particular sequence, (iv) a combination of (i) and(iii), or (v) a combination of (ii) and (iii).

U18. The composition of any one of embodiments U1 to U17, wherein anoligonucleotide species in the first pool of oligonucleotide speciescomprises no overhang and comprises an oligonucleotide overhangidentification sequence specific to having no overhang.

V1. A kit, comprising:

-   -   the composition of any one of embodiments U1 to U18; and    -   instructions for using the composition to produce a nucleic acid        library.

V2. The kit of embodiment V1, further comprising an agent comprising aphosphatase activity.

V3. The kit of embodiment V1 or V2, further comprising an agentcomprising a phosphoryl transfer activity.

V4. The kit of any one of embodiments V1 to V3, further comprising anagent comprising a ligase activity.

V5. The kit of any one of embodiments V1 to V4, further comprising anagent comprising a cleavage activity.

V6. The kit of any one of embodiments V1 to V5, further comprising anagent comprising a polymerase activity.

V7. The kit of any one of embodiments V1 to V6, further comprising afirst amplification primer species and a second amplification primerspecies, wherein the first primer species comprises a nucleotidesequence complementary to the first primer binding domain and the secondprimer species comprises a nucleotide sequence complementary to thesecond primer binding domain.

V8. The kit of any one of embodiments V1 to V7, further comprising oneor more agents for performing nucleic acid amplification.

W1. A method for producing a nucleic acid library, comprising:

-   -   a) combining a nucleic acid composition comprising target        nucleic acids and a first pool of oligonucleotide species,        wherein:        -   i) some or all of the target nucleic acids comprise an            overhang,        -   ii) some or all of the oligonucleotides in the first pool of            oligonucleotide species comprise an overhang at a first end            capable of hybridizing to a target nucleic acid overhang,            wherein each oligonucleotide species has a unique overhang            sequence and length,        -   iii) each oligonucleotide in the first pool of            oligonucleotide species comprises an oligonucleotide            overhang identification sequence specific to one or more            features of the oligonucleotide overhang,        -   iv) each oligonucleotide in the first pool of            oligonucleotide species comprises a first primer binding            domain, and        -   v) the nucleic acid composition and the first pool of            oligonucleotide species are combined under conditions in            which oligonucleotide overhangs hybridize to target nucleic            acid overhangs having a corresponding length, thereby            forming a first set of combined products;    -   b) cleaving the first set of combined products, thereby forming        cleaved products; and    -   c) combining the cleaved products and a second pool of        oligonucleotide species, wherein:        -   i) each oligonucleotide in the second pool of            oligonucleotide species comprises a first strand and a            second strand, wherein the first strand is shorter than the            second strand, and wherein the first strand and the second            strand are complementary at a first end of the            oligonucleotide and the second strand comprises a single            strand at a second end of the oligonucleotide,        -   ii) each oligonucleotide in the second pool of            oligonucleotide species comprises an oligonucleotide            identification sequence specific to the second pool of            oligonucleotide species,        -   iii) each oligonucleotide in the second pool of            oligonucleotide species comprises a second primer binding            domain on the second strand, wherein the first primer            binding domain and the second primer binding domain are            different, and        -   iv) the cleaved products and the second pool of            oligonucleotide species are combined under conditions in            which oligonucleotides in the second pool of oligonucleotide            species attach to at least one end of the cleaved products,            thereby forming a second set of combined products.

W1.1 The method of embodiment W1, further comprising:

-   -   d) contacting, under amplification conditions, the second set of        combined products with two or more amplification primer species,        wherein a first primer species comprises a nucleotide sequence        complementary to the first primer binding domain and a second        primer species comprises a nucleotide sequence complementary to        the second primer binding domain, thereby generating        amplification products.

W2. The method of embodiment W1 or W1.1, wherein the target nucleicacids comprise nucleic acid fragments larger than 500 bp.

W3. The method of embodiment W1 or W1.1, wherein the target nucleicacids comprise nucleic acid fragments larger than 1000 bp.

W4. The method of any one of embodiments W1 to W3, wherein (b) comprisescontacting the first set of combined products under cleavage conditionswith one or more cleavage agents capable of cleaving the combinedproducts.

W5. The method of any one of embodiments W1 to W3, wherein (b) comprisesmechanical shearing.

W6. The method of any one of embodiments W1 to W5, wherein some or allof the oligonucleotides in the first pool of oligonucleotide speciescomprise one or more modified nucleotides at a second end.

W7. The method of embodiment W6, wherein the one or more modifiednucleotides are capable of blocking attachment of the second end of theoligonucleotide to target nucleic acids.

W8. The method of any one of embodiments W1 to W7, wherein some or allof the oligonucleotides in the second pool of oligonucleotide speciescomprise one or more modified nucleotides at the second end.

W9. The method of embodiment W8, wherein the one or more modifiednucleotides are capable of blocking attachment of the second end of theoligonucleotide to the cleaved products.

W10. The method of any one of embodiments W1 to W9, further comprisingsequencing the amplification products by a sequencing process.

W11. The method of embodiment W10, wherein the sequencing processgenerates short sequence reads.

W12. The method of any one of embodiments W1 to W11, wherein the firstpool of oligonucleotide species comprises oligonucleotides having a 5′overhang and oligonucleotides having a 3′ overhang.

W13. The method of any one of embodiments W1 to W12, wherein the firstpool of oligonucleotide species comprises oligonucleotides having a 5′overhang, oligonucleotides having a 3′ overhang, and oligonucleotideshaving no overhang.

W14. The method of embodiment W12 or W13, wherein the oligonucleotidesthat comprise an overhang comprise a duplex portion, and asingle-stranded overhang.

W15. The method of any one of embodiments W12 to W14, wherein theoligonucleotides that comprise an overhang comprise (1) two strands andan overhang at a first end and two non-complementary strands at a secondend, or (2) one strand capable of forming a hairpin structure having asingle-stranded loop and an overhang.

W16. The method of any one of embodiments W12 to W15, wherein theoligonucleotide overhang comprises 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11,12, 13, 14, 15, 16, 17, 18, 19 or 20 nucleotides.

W17. The method of any one of embodiments W1 to W16, wherein theoligonucleotides in the first pool of oligonucleotide species compriseoligonucleotide overhangs having different sequences for a particularoverhang length.

W18. The method of embodiment W17, wherein the oligonucleotides in thefirst pool of oligonucleotide species comprise all possible overhangsequence combinations for a particular overhang length.

W19. The method of embodiment W18, wherein the oligonucleotides in thefirst pool of oligonucleotide species comprise all possible overhangsequence combinations for each overhang length.

W20. The method of embodiment W17, W18 or W19, wherein theoligonucleotide overhang sequences are random.

W21. The method of any one of embodiments W1 to W20, wherein an end ofan oligonucleotide in the first pool of oligonucleotide species iscapable of being covalently linked to an end of a target nucleic acid towhich the oligonucleotide is hybridized in the first set of combinedproducts.

W22. The method of embodiment W21, wherein the 3′ end of anoligonucleotide strand in the first pool of oligonucleotide species iscapable of being covalently linked to the 5′ end of a strand in thetarget nucleic acid to which the oligonucleotide is hybridized in thefirst set of combined products.

W23. The method of any one of embodiments W1 to W22, comprising after(b) repairing the ends of the cleaved products.

W24. The method of any one of embodiments W1 to W23, comprising after(b) adding one or more unpaired nucleotides to the ends of the cleavedproducts.

W25. The method of embodiment W24, wherein the oligonucleotides in thesecond pool of oligonucleotide species comprise one or more nucleotidesat the first end that are complementary to the one or more nucleotidesadded to the cleaved products.

W26. The method of embodiment W25, wherein the oligonucleotides in thesecond pool of oligonucleotide species hybridize at the first end to atleast one end of the cleaved products.

W27. The method of any one of embodiments W1 to W26, wherein an end ofan oligonucleotide in the second pool of oligonucleotide species iscapable of being covalently linked to an end of a cleaved product towhich the oligonucleotide is attached in the second set of combinedproducts.

W28. The method of embodiment W27, wherein the 3′ end of anoligonucleotide strand in the second pool of oligonucleotide species iscapable of being covalently linked to the 5′ end of a strand in thecleaved product to which the oligonucleotide is attached in the secondset of combined products.

W29. The method of any one of embodiments W1 to W28, wherein theoligonucleotides in the second pool of oligonucleotide species do notcomprise an overhang capable of hybridizing to a native target nucleicacid overhang.

W29.1 The method of any one of embodiments W1 to W29, wherein theoligonucleotides in the second pool of oligonucleotide species do notcomprise an oligonucleotide overhang identification sequence.

W30. The method of any one of embodiments W1 to W29.1, wherein theoligonucleotide overhang identification sequence on each oligonucleotidein the first pool of oligonucleotide species is specific to length ofthe oligonucleotide overhang.

W31. The method of any one of embodiments W1 to W30, wherein theoligonucleotide overhang identification sequence on each oligonucleotidein the first pool of oligonucleotide species is specific to length ofthe oligonucleotide overhang and is specific to one or more features ofthe oligonucleotide overhang chosen from (i) a 5′ overhang, (ii) a 3′overhang, (iii) a particular sequence, (iv) a combination of (i) and(iii), or (v) a combination of (ii) and (iii).

W32. The method of any one of embodiments W1 to W31, wherein some of thetarget nucleic acids comprise no overhang.

W33. The method of any one of embodiments W1 to W32, wherein anoligonucleotide species in the first pool of oligonucleotide speciescomprises no overhang and comprises an oligonucleotide overhangidentification sequence specific to having no overhang.

W34. The method of any one of embodiments W1 to W33, wherein the targetnucleic acids comprising an overhang comprise a duplex region and asingle-stranded overhang.

W35. The method of any one of embodiments W1 to W34, wherein each targetnucleic acid comprising an overhang comprises an overhang at one end oran overhang at both ends.

W36. The method of any one of embodiments W1 to W35, wherein an end, orboth ends, of each target nucleic acid comprising an overhangindependently comprises a 5′ overhang or a 3′ overhang.

W37. The method of any one of embodiments W1 to W36, wherein the targetnucleic acids comprise deoxyribonucleic acid (DNA) fragments.

W38. The method of embodiment W37, wherein the DNA fragments areobtained from cells.

W39. The method of embodiment W37 or W38, wherein the DNA fragmentscomprise genomic DNA fragments.

W40. The method of any one of embodiments W1 to W36, wherein the targetnucleic acids comprise ribonucleic acid (RNA) fragments.

W41. The method of embodiment W40, wherein the RNA fragments areobtained from cells.

W42. The method of any one of embodiments W1 to W41, wherein the targetnucleic acids comprise cell-free nucleic acid fragments.

W43. The method of any one of embodiments W1 to W42, wherein the targetnucleic acids comprise circulating cell-free nucleic acid fragments.

W44. The method of any one of embodiments W1 to W43, wherein theoverhangs in target nucleic acids are native overhangs.

W45. The method of any one of embodiments W1 to W44, wherein theoverhangs in target nucleic acids are unmodified overhangs.

W46. The method of any one of embodiments W1 to W45, wherein the targetnucleic acids are not modified in length prior to combining with theplurality of oligonucleotide species.

W47. The method of any one of embodiments W1 to W46, comprisingpreparing the nucleic acid composition prior to (a), by a processconsisting essentially of isolating nucleic acid from a sample, therebygenerating the nucleic acid composition.

W48. The method of any one of embodiments W1 to W47, comprising exposingthe first set of combined products to conditions under which an end ofthe target nucleic acid is joined to an end of the oligonucleotide towhich it is hybridized.

W49. The method of embodiment W48, comprising contacting the first setof combined products with an agent comprising a ligase activity underconditions in which an end of a target nucleic acid is covalently linkedto an end of the oligonucleotide to which the target nucleic acid ishybridized.

W50. The method of any one of embodiments W1 to W49, comprising exposingthe second set of combined products to conditions under which an end ofthe cleaved product is joined to an end of the oligonucleotide to whichit is attached.

W51. The method of embodiment W50, comprising contacting the second setof combined products with an agent comprising a ligase activity underconditions in which an end of a cleaved product is covalently linked toan end of the oligonucleotide to which the target nucleic acid isattached.

W52. The method of any one of embodiments W1 to W51, comprising prior to(a), contacting the target nucleic acid composition with an agentcomprising a phosphatase activity under conditions in which targetnucleic acids are dephosphorylated, thereby generating adephosphorylated target nucleic acid composition.

W53. The method of embodiment W52, comprising prior to (a), contactingthe dephosphorylated target nucleic acid composition with an agentcomprising a phosphoryl transfer activity under conditions in which a 5′phosphate is added to a 5′ end of target nucleic acids.

W54. The method of any one of embodiments W1 to W53, comprising prior to(a), contacting the first pool of oligonucleotide species with an agentcomprising a phosphatase activity under conditions in which theoligonucleotides are dephosphorylated, thereby generating a first poolof dephosphorylated oligonucleotide species.

W55. The method of any one of embodiments W1 to W54, comprising prior to(c), contacting the second pool of oligonucleotide species with an agentcomprising a phosphoryl transfer activity under conditions in which a 5′phosphate is added to a 5′ end of the first strand.

W56. The method of any one of embodiments W1 to W55, wherein the targetnucleic acids are obtained from a sample from a subject.

W57. The method of embodiment W56, wherein the subject is a human.

W58. The method of any one of embodiments W1 to W57, comprising prior to(a), separating the target nucleic acids according to fragment length.

W59. The method of any one of embodiments W1 to W57, wherein the targetnucleic acids are not separated by length prior to (a).

X1. A composition comprising:

-   -   a) a first pool of oligonucleotide species, wherein:        -   i) some or all of the oligonucleotides in the first pool of            oligonucleotide species comprise an overhang capable of            hybridizing to a target nucleic acid overhang, wherein each            oligonucleotide species has a unique overhang sequence and            length,        -   ii) each oligonucleotide in the first pool of            oligonucleotide species comprises an oligonucleotide            overhang identification sequence specific to one or more            features of the oligonucleotide overhang, and        -   iii) each oligonucleotide in the first pool of            oligonucleotide species comprises a first primer binding            domain; and    -   b) a second pool of oligonucleotide species, wherein:        -   i) each oligonucleotide in the second pool of            oligonucleotide species comprises a first strand and a            second strand, wherein the first strand is shorter than the            second strand, and wherein the first strand and the second            strand are complementary at a first end of the            oligonucleotide and the second strand comprises a single            strand at a second end of the oligonucleotide,        -   ii) each oligonucleotide in the first pool of            oligonucleotide species comprises an oligonucleotide            identification sequence specific to the second pool of            oligonucleotide species, and        -   (iii) each oligonucleotide in the second pool of            oligonucleotide species comprises a second primer binding            domain on the second strand, wherein the first primer            binding domain and the second primer binding domain are            different.

X2. The composition of embodiment X1, wherein some or all of theoligonucleotides in the first pool of oligonucleotide species compriseone or more modified nucleotides at a second end.

X3. The composition of embodiment X2, wherein the one or more modifiednucleotides are capable of blocking attachment of the second end of theoligonucleotide to target nucleic acids.

X4. The composition of any one of embodiments X1 to X3, wherein some orall of the oligonucleotides in the second pool of oligonucleotidespecies comprise one or more modified nucleotides at the second end.

X5. The composition of embodiment X4, wherein the one or more modifiednucleotides are capable of blocking attachment of the second end of theoligonucleotide to cleaved target nucleic acids.

X6. The composition of any one of embodiments X1 to X5, wherein thefirst pool of oligonucleotide species comprises oligonucleotides havinga 5′ overhang and oligonucleotides having a 3′ overhang.

X7. The composition of any one of embodiments X1 to X6, wherein thefirst pool of oligonucleotide species comprises oligonucleotides havinga 5′ overhang, oligonucleotides having a 3′ overhang, andoligonucleotides having no overhang.

X8. The composition of embodiment X6 or X7, wherein the oligonucleotidesthat comprise an overhang comprise a duplex portion, and asingle-stranded overhang.

X9. The composition of any one of embodiments X6 to X8, wherein theoligonucleotides that comprise an overhang comprise (1) two strands andan overhang at a first end and two non-complementary strands at a secondend, or (2) one strand capable of forming a hairpin structure having asingle-stranded loop and an overhang.

X10. The composition of any one of embodiments X6 to X9, wherein theoligonucleotide overhang comprises 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11,12, 13, 14, 15, 16, 17, 18, 19 or 20 nucleotides.

X11. The composition of any one of embodiments X1 to X10, wherein theoligonucleotides in the first pool of oligonucleotide species compriseoligonucleotide overhangs having different sequences for a particularoverhang length.

X12. The composition of embodiment X11, wherein the oligonucleotides inthe first pool of oligonucleotide species comprise all possible overhangsequence combinations for a particular overhang length.

X13. The composition of embodiment X12, wherein the oligonucleotides inthe first pool of oligonucleotide species comprise all possible overhangsequence combinations for each overhang length.

X14. The composition of embodiment X11, X12, or X13, wherein theoligonucleotide overhang sequences are random.

X15. The composition of any one of embodiments X1 to X14, wherein theoligonucleotides in the second pool of oligonucleotide species compriseone or more unpaired nucleotides at the first end.

X16. The composition of any one of embodiments X1 to X15, wherein theoligonucleotides in the second pool of oligonucleotide species do notcomprise an overhang capable of hybridizing to a native target nucleicacid overhang.

X16.1 The composition of any one of embodiments X1 to X16, wherein theoligonucleotides in the second pool of oligonucleotide species do notcomprise an oligonucleotide overhang identification sequence.

X17. The composition of any one of embodiments X1 to X16.1, wherein theoligonucleotide overhang identification sequence on each oligonucleotidein the first pool of oligonucleotide species is specific to length ofthe oligonucleotide overhang.

X18. The composition of any one of embodiments X1 to X17, wherein theoligonucleotide overhang identification sequence on each oligonucleotidein the first pool of oligonucleotide species is specific to length ofthe oligonucleotide overhang and is specific to one or more features ofthe oligonucleotide overhang chosen from (i) a 5′ overhang, (ii) a 3′overhang, (iii) a particular sequence, (iv) a combination of (i) and(iii), or (v) a combination of (ii) and (iii).

X19. The composition of any one of embodiments X1 to X18, wherein anoligonucleotide species in the first pool of oligonucleotide speciescomprises no overhang and comprises an oligonucleotide overhangidentification sequence specific to having no overhang.

Y1. A kit, comprising:

-   -   the composition of any one of embodiments X1 to X19; and    -   instructions for using the composition to produce a nucleic acid        library.

Y2. The kit of embodiment Y1, further comprising an agent comprising aphosphatase activity.

Y3. The kit of embodiment Y1 or Y2, further comprising an agentcomprising a phosphoryl transfer activity.

Y4. The kit of any one of embodiments Y1 to Y3, further comprising anagent comprising a ligase activity.

Y5. The kit of any one of embodiments Y1 to Y4, further comprising anagent comprising a cleavage activity.

Y6. The kit of any one of embodiments Y1 to Y5, further comprising anagent comprising a polymerase activity.

Y7. The kit of any one of embodiments Y1 to Y6, further comprising afirst amplification primer species and a second amplification primerspecies, wherein the first primer species comprises a nucleotidesequence complementary to the first primer binding domain and the secondprimer species comprises a nucleotide sequence complementary to thesecond primer binding domain.

Y8. The kit of any one of embodiments Y1 to Y7, further comprising oneor more agents for performing nucleic acid amplification.

Z1. A method for producing a nucleic acid library, comprising:

-   -   a) contacting a nucleic acid composition comprising target        nucleic acids with an agent comprising a phosphatase activity        under conditions in which target nucleic acids are        dephosphorylated, thereby generating dephosphorylated target        nucleic acids, wherein some or all of the target nucleic acids        comprise an overhang; and    -   b) combining the dephosphorylated target nucleic acids and a        plurality of oligonucleotide species, wherein:        -   i) some or all of the oligonucleotides in the plurality of            oligonucleotide species comprise an overhang capable of            hybridizing to a target nucleic acid overhang, wherein each            oligonucleotide species has a unique overhang sequence and            length;        -   ii) each oligonucleotide in the plurality of oligonucleotide            species comprises an oligonucleotide overhang identification            sequence specific to one or more features of the            oligonucleotide overhang; and        -   iii) the nucleic acid composition and the plurality of            oligonucleotide species is combined under conditions in            which oligonucleotide overhangs hybridize to target nucleic            acid overhangs having a corresponding length, thereby            forming hybridization products.

Z2. The method of embodiment Z1, comprising prior to (b), contacting thedephosphorylated target nucleic acids with an agent comprising aphosphoryl transfer activity under conditions in which a 5′ phosphate isadded to a 5′ end of target nucleic acids.

Z3. The method of embodiment Z1 or Z2, comprising prior to (b),contacting the plurality of oligonucleotide species with an agentcomprising a phosphatase activity under conditions in which theoligonucleotides are dephosphorylated, thereby generating a plurality ofdephosphorylated oligonucleotide species.

Z4. The method of any one of embodiments Z1 to Z3, wherein some or allof the oligonucleotides in the plurality of oligonucleotide speciescomprise two strands, and the overhang at a first end and twonon-complementary strands at a second end.

Z5. The method of any one of embodiments Z1 to Z3, wherein some or allof the oligonucleotides in the plurality of oligonucleotide speciescomprise one strand capable of forming a hairpin structure having asingle-stranded loop and an overhang.

Z6. The method of any one of embodiments Z1 to Z5, comprising sequencingthe hybridization products, or amplification products thereof, by asequencing process, thereby generating sequence reads, wherein thesequence reads comprise forward sequence reads and reverse sequencereads.

Z7. The method of embodiment Z6, comprising quantifying the sequencereads thereby generating a sequence read quantification, wherein thereverse sequence reads are quantified, and the forward sequence readsare excluded from the quantification.

Z8. The method of embodiment Z6, comprising analyzing overhanginformation associated with overhang identification sequences thatindicate presence of an overhang for the reverse sequence reads, therebygenerating an analysis.

Z9. The method of embodiment Z8, comprising omitting from the analysisoverhang information associated with overhang identification sequencesthat indicate presence of an overhang for the forward sequence reads.

Z10. The method of embodiment Z8 or Z9, comprising analyzing overhanginformation associated with overhang identification sequences thatindicate no overhang for the forward sequence reads and the reversesequence reads.

Z11. The method of any one of embodiments Z1 to Z10, comprising one ormore features of any one of embodiments A1 to A68, C1 to C68, E1 to E64,G1 to G56, I1 to I76, Q1 to Q42, T1 to T58, and W1 to W59.

A′1. A method for analyzing nucleic acid comprising:

-   -   a) combining a nucleic acid composition comprising target        nucleic acids and a plurality of oligonucleotide species,        wherein:        -   i) some or all of the target nucleic acids comprise an            overhang;        -   ii) some or all of the oligonucleotides in the plurality of            oligonucleotide species comprise an overhang capable of            hybridizing to a target nucleic acid overhang, wherein each            oligonucleotide species has a unique overhang sequence and            length;        -   iii) each oligonucleotide in the plurality of            oligonucleotide species comprises an oligonucleotide            overhang identification sequence specific to one or more            features of the oligonucleotide overhang; and        -   iv) the nucleic acid composition and the plurality of            oligonucleotide species is combined under conditions in            which oligonucleotide overhangs hybridize to target nucleic            acid overhangs having a corresponding length, thereby            forming hybridization products;    -   b) sequencing the hybridization products, or amplification        products thereof, by a sequencing process, thereby generating        sequence reads, wherein the sequence reads comprise forward        sequence reads and reverse sequence reads; and    -   c) analyzing overhang information associated with overhang        identification sequences that indicate presence of an overhang        for the reverse sequence reads, thereby generating an analysis,        and omitting from the analysis overhang information associated        with overhang identification sequences that indicate presence of        an overhang for the forward sequence reads.

A′2. The method of embodiment A′1, wherein (c) comprises analyzingoverhang information associated with overhang identification sequencesthat indicate no overhang for the forward sequence reads and the reversesequence reads.

A′3. The method of embodiment A′1 or A′2, wherein some or all of theoligonucleotides in the plurality of oligonucleotide species comprisetwo strands, and the overhang at a first end and two non-complementarystrands at a second end.

A′4. The method of embodiment A′1 or A′2, wherein some or all of theoligonucleotides in the plurality of oligonucleotide species compriseone strand capable of forming a hairpin structure having asingle-stranded loop and an overhang.

A′5. The method of any one of embodiments A′1 to A′4, comprising priorto (a), contacting the target nucleic acids with an agent comprising aphosphatase activity under conditions in which target nucleic acids aredephosphorylated, thereby generating dephosphorylated target nucleicacids.

A′6. The method of embodiment A′5, comprising contacting thedephosphorylated target nucleic acids with an agent comprising aphosphoryl transfer activity under conditions in which a 5′ phosphate isadded to a 5′ end of target nucleic acids.

A′7. The method of any one of embodiments A′1 to A′6, comprising priorto (a), contacting the plurality of oligonucleotide species with anagent comprising a phosphatase activity under conditions in which theoligonucleotides are dephosphorylated, thereby generating a plurality ofdephosphorylated oligonucleotide species.

A′8. The method of any one of embodiments A′1 to A′7, wherein (c) isperformed using a microprocessor.

A′9. The method of any one of embodiments A′1 to A′8, comprising one ormore features of any one of embodiments A1 to A68, C1 to C68, E1 to E64,G1 to G56, I1 to I76, Q1 to Q42, T1 to T58, and W1 to W59.

The entirety of each patent, patent application, publication anddocument referenced herein hereby is incorporated by reference. Citationof the above patents, patent applications, publications and documents isnot an admission that any of the foregoing is pertinent prior art, nordoes it constitute any admission as to the contents or date of thesepublications or documents. Their citation is not an indication of asearch for relevant disclosures. All statements regarding the date(s) orcontents of the documents is based on available information and is notan admission as to their accuracy or correctness.

Modifications may be made to the foregoing without departing from thebasic aspects of the technology. Although the technology has beendescribed in substantial detail with reference to one or more specificembodiments, those of ordinary skill in the art will recognize thatchanges may be made to the embodiments specifically disclosed in thisapplication, yet these modifications and improvements are within thescope and spirit of the technology.

The technology illustratively described herein suitably may be practicedin the absence of any element(s) not specifically disclosed herein.Thus, for example, in each instance herein any of the terms“comprising,” “consisting essentially of,” and “consisting of” may bereplaced with either of the other two terms. The terms and expressionswhich have been employed are used as terms of description and not oflimitation, and use of such terms and expressions do not exclude anyequivalents of the features shown and described or portions thereof, andvarious modifications are possible within the scope of the technologyclaimed. The term “a” or “an” can refer to one of or a plurality of theelements it modifies (e.g., “a reagent” can mean one or more reagents)unless it is contextually clear either one of the elements or more thanone of the elements is described. The term “about” as used herein refersto a value within 10% of the underlying parameter (i.e., plus or minus10%), and use of the term “about” at the beginning of a string of valuesmodifies each of the values (i.e., “about 1, 2 and 3” refers to about 1,about 2 and about 3). For example, a weight of “about 100 grams” caninclude weights between 90 grams and 110 grams. Further, when a listingof values is described herein (e.g., about 50%, 60%, 70%, 80%, 85% or86%) the listing includes all intermediate and fractional values thereof(e.g., 54%, 85.4%). Thus, it should be understood that although thepresent technology has been specifically disclosed by representativeembodiments and optional features, modification and variation of theconcepts herein disclosed may be resorted to by those skilled in theart, and such modifications and variations are considered within thescope of this technology.

Certain embodiments of the technology are set forth in the claim(s) thatfollow(s).

1. (canceled)
 2. A composition comprising a plurality of oligonucleotidespecies, wherein: a) each oligonucleotide in the plurality ofoligonucleotide species comprises two strands; b) each oligonucleotidein the plurality of oligonucleotide species comprises an overhang at afirst end, or some of the oligonucleotides in the plurality ofoligonucleotide species comprise an overhang at a first end and some ofthe oligonucleotides in the plurality of oligonucleotide speciescomprise no overhang at a first end, wherein the overhang, when present,is capable of hybridizing to a target nucleic acid overhang, whereineach oligonucleotide species having an overhang has a unique overhangsequence and length; c) each oligonucleotide in the plurality ofoligonucleotide species comprises at a second end (i) twonon-complementary strands and (ii) one or more blocked nucleotides; andd) each oligonucleotide in the plurality of oligonucleotide speciescomprises an oligonucleotide overhang identification sequence specificto no overhang or specific to one or more features of theoligonucleotide overhang, wherein the one or more features compriselength of the overhang.
 3. The composition of claim 2, wherein theplurality of oligonucleotide species comprises oligonucleotides having a5′ overhang and oligonucleotides having a 3′ overhang.
 4. Thecomposition of claim 2, wherein the plurality of oligonucleotide speciescomprises oligonucleotides having a 5′ overhang, oligonucleotides havinga 3′ overhang, and oligonucleotides having no overhang.
 5. Thecomposition of claim 2, wherein the oligonucleotides in the plurality ofoligonucleotide species having overhangs comprise: i) oligonucleotideoverhangs having different sequences for a particular overhang length;ii) all possible overhang sequence combinations for a particularoverhang length; or iii) all possible overhang sequence combinations foreach overhang length.
 6. The composition of claim 5, wherein theoligonucleotide overhang sequences are random.
 7. The composition ofclaim 2, wherein the oligonucleotide overhang identification sequence isspecific to length of the oligonucleotide overhang and is specific toone or more features of the oligonucleotide overhang chosen from (i) a5′ overhang, (ii) a 3′ overhang, (iii) a particular sequence, (iv) acombination of (i) and (iii), or (v) a combination of (ii) and (iii). 8.The composition of claim 2, wherein an oligonucleotide species comprisesno overhang and comprises an oligonucleotide overhang identificationsequence specific to having no overhang.
 9. The composition of claim 2,wherein each of the two non-complementary strands at the second end ofthe oligonucleotide species comprises a primer binding domain, whereinone of the non-complementary strands comprises a first primer bindingdomain, and the other non-complementary strand comprises a second primerbinding domain, wherein the first primer binding domain and secondprimer binding domain are different.
 10. The composition of claim 2,wherein each oligonucleotide overhang identification sequence comprisesa nucleic acid sequence that is unique for each oligonucleotidecomprising an overhang of a particular length or no overhang.
 11. Thecomposition of claim 2, wherein each oligonucleotide overhangidentification sequence comprises a nucleic acid sequence that is uniquefor each oligonucleotide comprising an overhang of a particular lengthor no overhang and a particular 5′ or 3′ overhang directionality, whenan overhang is present.
 12. A kit comprising: the composition of claim2; and instructions for using the composition to produce a nucleic acidlibrary.
 13. The kit of claim 12, further comprising an agent comprisinga phosphatase activity.
 14. The kit of claim 12, further comprising anagent comprising a phosphoryl transfer activity.
 15. The kit of claim12, further comprising an agent comprising a ligase activity.
 16. Thekit of claim 12, further comprising an agent comprising a nick-sealingligase activity.
 17. The kit of claim 12, further comprising astrand-displacing polymerase.